← Deep SQL & Production Data Engineering

Your first DAG: extract → load → dbt → publish → notify

The shape every analytics pipeline ends up with, and the four properties that make it survive contact with production.

The 90-day backfill that doubled every metric

A team I knew needed to backfill 90 days of revenue history. They cleared the destination table, set catchup=True, and re-ran the DAG. Each daily run inserted that day's revenue. Looked clean.

Next morning the metrics dashboard showed every day's revenue doubled. Cause: the DAG had been running on schedule throughout the backfill — so the daily job ran both the catchup version (for the historic date) and the regular schedule version (for today, which somehow recomputed yesterday's data and inserted again because the load was non-idempotent).

The fix was a unique_key on the destination, plus turning off the schedule during backfills. The lesson: Airflow runs your code more often than you expect, and your code must be ready for it.

The shape: extract → load → dbt → publish → notify

Five tasks, one branch — the test gate. Read it left-to-right: extract pulls yesterday's data, staging upserts it into bronze, dbt builds the silver and gold models, dbt_test gates promotion, publish flips the BI-facing views, notify summarizes the run. The same five tasks below, in Python — shape first, code second.

Skeleton of a daily analytics DAG with retries and idempotent tasks

from airflow.decorators import dag, task
from datetime import datetime, timedelta

@dag(
    start_date=datetime(2026, 1, 1),
    schedule="@daily",
    catchup=False,                       # set True only with idempotent loads
    default_args={
        "retries": 2,
        "retry_delay": timedelta(minutes=5),
        "execution_timeout": timedelta(minutes=30),
    },
    tags=["analytics", "daily"],
)
def daily_marts():
    @task
    def extract(ds: str):
        # Pull yesterday's data from the source. ds is the logical date.
        ...

    @task
    def load_staging(ds: str):
        # UPSERT into staging — idempotent under retry.
        ...

    @task
    def dbt_run():
        # Triggers dbt CLI; dbt's own state makes incremental models idempotent.
        ...

    @task
    def dbt_test():
        # Fail the DAG if any test fails — the gate before publication.
        ...

    @task
    def publish():
        # Promote marts views to public schema; analyst-visible.
        ...

    @task
    def notify():
        # Slack/email summary of what ran and what didn't.
        ...

    extract() >> load_staging() >> dbt_run() >> dbt_test() >> publish() >> notify()

daily_marts()

Four properties of a DAG that survives production.

Idempotent tasks — running task N twice produces the same result. Without this, retries duplicate.
Bounded retries — retries=2 with exponential backoff, not retries=99. Infinite retries hide root causes.
execution_timeout — every task has a wall-clock cap. A stuck task should fail loudly, not silently delay the rest of the DAG.
Dependencies, not side-effects — task B reading from task A's output goes through Airflow's XCom (small) or a known table (large). Don't rely on "task B always runs on the same machine as task A" — Airflow rarely guarantees that.

Build a *backfill-safe* INSERT pattern as a single SELECT. You receive `(day, country, revenue_cents)` from extract. The destination already has rows; you must produce the resulting row set after an idempotent merge. Use `WITH dest(...) AS (VALUES ...)` and `WITH source(...) AS (VALUES ...)` to simulate. Return the merged set with `(day, country, revenue_cents)`, ordered by day, country.

Takeaway: Airflow is the orchestrator, not the logic. Tasks call out to dbt/SQL/Spark and are idempotent under retry. Backfill is safe because the load is INSERT ... ON CONFLICT, not TRUNCATE + INSERT. Four properties — idempotent, bounded retries, timeouts, explicit dependencies — make the difference between a DAG that runs and a DAG you trust.