← Deep SQL & Production Data Engineering

Idempotent UPSERT and the load that you can safely run twice

How to write an INSERT that survives retries, partial failures, and the inevitable 'rerun the pipeline' from your on-call self at 3 AM.

Why idempotent matters

Every production pipeline retries. A transient network error, a Kubernetes pod restart, a Friday-afternoon human typing dbt run again "to be safe" — your load step will run more than once on the same input. Idempotent means: running it N times has the same effect as running it once.

A non-idempotent INSERT loaded twice is a duplicated row. A duplicated row is a doubled metric. A doubled metric is the morning the CEO calls finance. The fix is mechanical: every insert into a long-lived table must be upsert-shaped.

ON CONFLICT DO UPDATE — the most useful 5 lines you'll write this year

INSERT INTO marts.daily_revenue (day, country, revenue_cents)
VALUES (:day, :country, :revenue_cents)
ON CONFLICT (day, country)        -- requires a UNIQUE constraint on (day, country)
DO UPDATE SET
  revenue_cents = EXCLUDED.revenue_cents,
  loaded_at     = now();
-- EXCLUDED is a virtual table holding the row that *would have been* inserted.

-- For a "do nothing on duplicate" pattern (e.g. event ingestion where the source emits
-- the same event multiple times):
INSERT INTO oltp.events (event_id, ...)
VALUES (...)
ON CONFLICT (event_id) DO NOTHING;

ON CONFLICT requires a unique constraint. No constraint = no conflict detection = the INSERT just inserts. The constraint is what makes the operation safe under retry; the SQL is just the convenience.

A common bug: writing ON CONFLICT (day, country) without first having UNIQUE (day, country). Postgres errors with "there is no unique or exclusion constraint matching the ON CONFLICT specification". Add the constraint first, or use ON CONFLICT ON CONSTRAINT my_constraint_name.

SQL standard `MERGE` — when ON CONFLICT isn't enough

Postgres 15 added MERGE INTO ... USING ... ON ... WHEN MATCHED ... WHEN NOT MATCHED, the SQL-standard upsert. More verbose than ON CONFLICT, but it lets you express:

insert this batch of rows; for matching keys update; for keys present in target but not source, soft-delete.

For the common case (insert-or-update one row), ON CONFLICT is shorter and faster. For full batch reconciliation (the dbt incremental pattern, the MERGE lakehouse pattern), MERGE is closer to what you mean.

Build a tiny upsert demo. Create a `daily_revenue (day date, country text, revenue_cents bigint, PRIMARY KEY (day, country))` table inline (using `WITH ... INSERT` won't work — instead use a `WITH t(...) AS (VALUES ...)` to *simulate* the result of an upsert that would deduplicate keys). Return one row per (day, country) with the latest revenue, ordered by day, country.

Takeaway: every load step into a long-lived table is INSERT ... ON CONFLICT DO UPDATE/NOTHING, and the (day, key) tuple it conflicts on has a UNIQUE constraint. With those two pieces in place, the same load can run a thousand times and produce the same result. That's idempotency, and it's the property that makes 3 AM less bad.

Idempotent UPSERT and the load that you can safely run twice

Why idempotent matters

SQL standard MERGE — when ON CONFLICT isn't enough

SQL standard `MERGE` — when ON CONFLICT isn't enough