Watermarks, overlap windows, and the missed updates that ate a Tuesday
Why `WHERE updated_at > max(updated_at)` is wrong, and the safer pattern every incremental load needs.
The Tuesday that lost 412 updates
A team I knew ran an incremental load:
INSERT INTO marts.fct_orders
SELECT * FROM oltp.orders
WHERE updated_at > (SELECT MAX(updated_at) FROM marts.fct_orders);
Reasonable shape. Wrong on a busy database.
Monday's high watermark was 2025-03-10 23:59:58.913. Tuesday's load picked up everything strictly after that. But four hundred updates on Monday had been assigned updated_at of 2025-03-10 23:59:58.913 (down to the millisecond), then written a few seconds later because the database was under load. Some of those writes happened during Tuesday's load — strictly after the high watermark moment but with the same timestamp. They were never picked up.
Finance noticed when month-end totals didn't match. The fix: read the last N minutes/hours on every run, not just strictly newer. UPSERT into the destination so the overlap doesn't duplicate. Watermark + overlap.
Watermark + overlap, on a time line
The story's real bug: under load the source assigns updated_at before the row hits disk, so two writes can share an updated_at equal to the previous run's high watermark.
last run high watermark this run starts
│ │
────────────────●──────────────────────────────────●────────────► wall-clock
11:59:58.913 14:00
OVERLAP window (1 hr): re-read updated_at > 10:59:58.913
Two source rows, both with updated_at = 11:59:58.913:
row #A written 11:59:59.001 → caught by previous run
row #B written 12:00:00.700 → writer was still in flight at last commit
NAIVE WHERE updated_at > 11:59:58.913 → row #B missed (NOT >) ☢
OVERLAP WHERE updated_at > 10:59:58.913 → row #B caught
+ INSERT … ON CONFLICT (id) DO UPDATE → row #A re-read = no-op
→ idempotent ✓
Size the overlap to absorb expected lateness: a streaming source with a 10-min producer→ingest delay needs ~30 min of overlap; a batched daily source with same-day-late corrections needs ≥ 24 hours. The UPSERT keys are what make the overlap free — re-reading already-ingested rows costs nothing because the destination doesn't change.
-- Pattern: re-read the last hour, UPSERT, advance the watermark.
INSERT INTO marts.fct_orders (order_id, customer_id, status, total_cents, updated_at, loaded_at)
SELECT order_id, customer_id, status, total_cents, updated_at, now()
FROM oltp.orders
WHERE updated_at > (SELECT MAX(updated_at) FROM marts.fct_orders) - INTERVAL '1 hour'
ON CONFLICT (order_id) DO UPDATE SET
status = EXCLUDED.status,
total_cents = EXCLUDED.total_cents,
updated_at = EXCLUDED.updated_at,
loaded_at = now();
-- The 1-hour overlap covers timestamp-tied writes and short clock skews.
-- The UPSERT means re-pulling the same row is a no-op (idempotent).
-- The watermark advances naturally with each run.CDC vs watermark — pick the right tool.
- Watermark + overlap — works against any source with an
updated_at. Simple, slightly behind real-time, doesn't capture deletes unless the source soft-deletes (setsdeleted_at). - CDC (Change Data Capture) — streams every INSERT/UPDATE/DELETE from the source's WAL/binlog. Real-time, captures deletes, requires more infra (Debezium / Fivetran HVR / native Postgres logical replication). The right answer when you need <1 minute latency or hard-delete tracking.
For analytics, watermarking covers 90% of cases and is much cheaper to operate. CDC is needed when the consumer (operational read replica, ML feature store) demands it.
Reproduce the watermark bug. Inline two `orders` updates with the same `updated_at` (the timestamp tie). One was written before the watermark, one after. A naive `> max` query picks up zero rows. Show that. Return one row, one column `naive_picks_up_late_writes`, with value `false` (it doesn't).
Takeaway: pure > max(updated_at) is wrong on any source that has clock skew, contention, or millisecond-tied writes. The pattern is watermark + overlap + UPSERT: re-pull the last N minutes on every run; UPSERT makes it idempotent. CDC when you need real-time and delete tracking. Reconciliation when you want to know the numbers are right, not hope.