Codabra

Watermarks, overlap windows, and the missed updates that ate a Tuesday

Why `WHERE updated_at > max(updated_at)` is wrong, and the safer pattern every incremental load needs.

The Tuesday that lost 412 updates

A team I knew ran an incremental load:

INSERT INTO marts.fct_orders
SELECT * FROM oltp.orders
WHERE updated_at > (SELECT MAX(updated_at) FROM marts.fct_orders);

Reasonable shape. Wrong on a busy database.

Monday's high watermark was 2025-03-10 23:59:58.913. Tuesday's load picked up everything strictly after that. But four hundred updates on Monday had been assigned updated_at of 2025-03-10 23:59:58.913 (down to the millisecond), then written a few seconds later because the database was under load. Some of those writes happened during Tuesday's load — strictly after the high watermark moment but with the same timestamp. They were never picked up.

Finance noticed when month-end totals didn't match. The fix: read the last N minutes/hours on every run, not just strictly newer. UPSERT into the destination so the overlap doesn't duplicate. Watermark + overlap.

Watermark + overlap, on a time line

The story's real bug: under load the source assigns updated_at before the row hits disk, so two writes can share an updated_at equal to the previous run's high watermark.

        last run high watermark               this run starts
                  │                                  │
  ────────────────●──────────────────────────────────●────────────►  wall-clock
            11:59:58.913                          14:00

  OVERLAP window (1 hr): re-read updated_at > 10:59:58.913

  Two source rows, both with updated_at = 11:59:58.913:
    row #A   written 11:59:59.001  → caught by previous run
    row #B   written 12:00:00.700  → writer was still in flight at last commit

  NAIVE    WHERE updated_at > 11:59:58.913      → row #B missed (NOT >)  ☢
  OVERLAP  WHERE updated_at > 10:59:58.913      → row #B caught
           + INSERT … ON CONFLICT (id) DO UPDATE  → row #A re-read = no-op
                                                  → idempotent  ✓

Size the overlap to absorb expected lateness: a streaming source with a 10-min producer→ingest delay needs ~30 min of overlap; a batched daily source with same-day-late corrections needs ≥ 24 hours. The UPSERT keys are what make the overlap free — re-reading already-ingested rows costs nothing because the destination doesn't change.

The safe pattern: watermark + overlap + UPSERT
-- Pattern: re-read the last hour, UPSERT, advance the watermark.
INSERT INTO marts.fct_orders (order_id, customer_id, status, total_cents, updated_at, loaded_at)
SELECT order_id, customer_id, status, total_cents, updated_at, now()
FROM oltp.orders
WHERE updated_at > (SELECT MAX(updated_at) FROM marts.fct_orders) - INTERVAL '1 hour'
ON CONFLICT (order_id) DO UPDATE SET
  status      = EXCLUDED.status,
  total_cents = EXCLUDED.total_cents,
  updated_at  = EXCLUDED.updated_at,
  loaded_at   = now();

-- The 1-hour overlap covers timestamp-tied writes and short clock skews.
-- The UPSERT means re-pulling the same row is a no-op (idempotent).
-- The watermark advances naturally with each run.

CDC vs watermark — pick the right tool.

  • Watermark + overlap — works against any source with an updated_at. Simple, slightly behind real-time, doesn't capture deletes unless the source soft-deletes (sets deleted_at).
  • CDC (Change Data Capture) — streams every INSERT/UPDATE/DELETE from the source's WAL/binlog. Real-time, captures deletes, requires more infra (Debezium / Fivetran HVR / native Postgres logical replication). The right answer when you need <1 minute latency or hard-delete tracking.

For analytics, watermarking covers 90% of cases and is much cheaper to operate. CDC is needed when the consumer (operational read replica, ML feature store) demands it.

Reproduce the watermark bug. Inline two `orders` updates with the same `updated_at` (the timestamp tie). One was written before the watermark, one after. A naive `> max` query picks up zero rows. Show that. Return one row, one column `naive_picks_up_late_writes`, with value `false` (it doesn't).

Takeaway: pure > max(updated_at) is wrong on any source that has clock skew, contention, or millisecond-tied writes. The pattern is watermark + overlap + UPSERT: re-pull the last N minutes on every run; UPSERT makes it idempotent. CDC when you need real-time and delete tracking. Reconciliation when you want to know the numbers are right, not hope.