Empowering Analysts: Building Data Pipelines with YAML, dlt, dbt, and Trino – A Step-by-Step Guide

<h2>Overview</h2> <p>Data pipelines have traditionally been the domain of software engineers wielding PySpark or Python scripts. However, a new stack — <strong>dlt</strong> (data load tool), <strong>dbt</strong> (data build tool), and <strong>Trino</strong> — allows analysts to build and maintain pipelines using nothing more than YAML configuration files. This guide walks you through replacing complex PySpark pipelines with four YAML files, cutting delivery time from weeks to a single day. By the end, you’ll understand how to set up a pipeline that extracts, loads, transforms, and queries data without writing a single line of Python or Spark code.</p><figure style="margin:20px 0"><img src="https://towardsdatascience.com/wp-content/uploads/2026/04/Group-1-3-scaled-1.jpg" alt="Empowering Analysts: Building Data Pipelines with YAML, dlt, dbt, and Trino – A Step-by-Step Guide" style="width:100%;height:auto;border-radius:8px" loading="lazy"><figcaption style="font-size:12px;color:#666;margin-top:5px">Source: towardsdatascience.com</figcaption></figure> <h2>Prerequisites</h2> <p>Before diving in, ensure you have:</p> <ul> <li><strong>Basic familiarity with SQL</strong> – dbt relies on SQL for transformations.</li> <li><strong>Access to a data warehouse</strong> (e.g., Snowflake, BigQuery, Postgres) – Trino will serve as the query engine.</li> <li><strong>Python 3.8+</strong> installed (only for installing dlt and dbt; no coding required beyond setup).</li> <li><strong>YAML editor</strong> – any text editor works.</li> <li><strong>A source of data</strong> – API, database, or flat files you want to ingest.</li> </ul> <p>This guide assumes you are comfortable running terminal commands and editing configuration files.</p> <h2>Step-by-Step Instructions</h2> <h3>1. Setting Up the Tools</h3> <p>Install dlt and dbt using pip (or conda):</p> <pre><code>pip install dlt dbt-core trino</code></pre> <p>Verify installations:</p> <pre><code>dlt --version dbt --version trino --version</code></pre> <p>Create a project directory:</p> <pre><code>mkdir my_pipeline cd my_pipeline</code></pre> <h3>2. Configuring the Source – dlt YAML</h3> <p>dlt extracts data from sources and loads it into a destination. Create a file <code>sources.yml</code>:</p> <pre><code># sources.yml sources: my_api: type: rest_api config: base_url: "https://api.example.com/v1" endpoint: /data pagination: true # Add authentication if needed auth: api_key: "${API_KEY}" </code></pre> <p>This YAML tells dlt to fetch data from an API endpoint with pagination. Replace the URL and API key with your own. dlt supports many source types (databases, cloud storage, etc.).</p> <h3>3. Loading Data – dlt Destination YAML</h3> <p>Create <code>destinations.yml</code> to specify where data goes:</p> <pre><code># destinations.yml destinations: my_trino: type: trino config: host: localhost port: 8080 database: my_db user: analyst password: "${TRINO_PASSWORD}" </code></pre> <p>Now define a pipeline in <code>pipeline.yml</code> that links the source and destination:</p> <pre><code># pipeline.yml pipeline: name: my_first_pipeline source: my_api destination: my_trino tables: - name: raw_data primary_key: id incremental: true </code></pre> <p>Run the pipeline with a single command:</p> <pre><code>dlt pipeline run pipeline.yml</code></pre> <p>Data is now loaded into Trino under the <code>raw_data</code> table.</p> <h3>4. Transforming with dbt</h3> <p>dbt allows analysts to write SQL models. Initialize a dbt project inside your directory:</p> <pre><code>dbt init my_dbt_project</code></pre> <p>Edit <code>profiles.yml</code> to point to your Trino instance:</p> <pre><code># profiles.yml my_dbt_project: outputs: dev: type: trino method: none server: localhost:8080 database: my_db schema: analytics user: analyst password: "${TRINO_PASSWORD}" target: dev </code></pre> <p>Create a transformation model in <code>models/</code> – for example, <code>aggregated_data.sql</code>:</p><figure style="margin:20px 0"><img src="https://contributor.insightmediagroup.io/wp-content/uploads/2026/04/old_pipeline_process-3-1-5-1024x512.png" alt="Empowering Analysts: Building Data Pipelines with YAML, dlt, dbt, and Trino – A Step-by-Step Guide" style="width:100%;height:auto;border-radius:8px" loading="lazy"><figcaption style="font-size:12px;color:#666;margin-top:5px">Source: towardsdatascience.com</figcaption></figure> <pre><code>-- models/aggregated_data.sql SELECT EXTRACT(YEAR FROM event_date) AS year, EXTRACT(MONTH FROM event_date) AS month, category, SUM(revenue) AS total_revenue FROM {{ source('raw_data', 'raw_data') }} GROUP BY 1,2,3 </code></pre> <p>Run dbt to apply transformations:</p> <pre><code>dbt run</code></pre> <p>This creates a table or view in Trino’s <code>analytics</code> schema.</p> <h3>5. Querying with Trino</h3> <p>Now you can query the transformed data using any SQL client connected to Trino. For example:</p> <pre><code>-- Query from Trino CLI or your BI tool SELECT * FROM my_db.analytics.aggregated_data WHERE total_revenue > 100000 ORDER BY year, month; </code></pre> <p>That’s it – a complete pipeline defined in just four YAML files (<code>sources.yml</code>, <code>destinations.yml</code>, <code>pipeline.yml</code>, and dbt’s <code>profiles.yml</code> plus one SQL model).</p> <h2 id="common-mistakes">Common Mistakes</h2> <ul> <li><strong>Incorrect indentation in YAML</strong> – YAML is space-sensitive. Use 2 spaces per level, not tabs.</li> <li><strong>Missing environment variables</strong> – Never hardcode secrets; use <code>${VAR}</code> and export them before running.</li> <li><strong>Pagination not enabled</strong> – dlt defaults to single-page fetches. If your API returns many records, enable <code>pagination: true</code> or specify a cursor.</li> <li><strong>Database schema issues</strong> – Ensure the schema (<code>raw_data</code>) exists in Trino before running the dlt pipeline. dlt may create it automatically, but not always.</li> <li><strong>Trino user permissions</strong> – The user must have write access to the destination schema and read access to any sources.</li> <li><strong>dbt model referencing wrong source</strong> – Verify the source name in <code>source()</code> matches the table from dlt. Use <code>dbt docs generate</code> to check lineage.</li> <li><strong>Ignoring incremental loading</strong> – Without <code>incremental: true</code> in <code>pipeline.yml</code>, dlt will overwrite the entire table daily.</li> </ul> <h2 id="summary">Summary</h2> <p>By replacing PySpark with a stack of dlt, dbt, and Trino, organizations empower analysts to build and maintain data pipelines using YAML and SQL alone. The process reduces delivery time from weeks to one day, eliminates the need for dedicated engineering support, and keeps pipelines version-controlled and auditable. This guide demonstrated a complete end-to-end pipeline with four configuration files, covering extraction, loading, transformation, and querying. Start with a single use case, and scale from there.</p>
Tags: