Codabra

Online migrations: adding NOT NULL to a 200M-row table without taking the site down

What every long-lived table should carry, and the four-step migration playbook that turns a 30-minute table-rewriting lock into a series of millisecond operations.

The deploy that took a 200M-row table offline at peak

A team I knew shipped a migration on a Friday at 4 PM (already a mistake):

ALTER TABLE orders
  ADD COLUMN refund_reason text NOT NULL DEFAULT '';

In Postgres ≤10 this rewrites the entire table — 200M rows, ~80 GB, with an ACCESS EXCLUSIVE lock that blocks every reader and writer. The migration ran for 47 minutes. During that time, every checkout hit the lock and timed out. The site was down for forty-seven minutes on a Friday afternoon.

Postgres 11+ made ADD COLUMN ... DEFAULT cheap only when the default is a constant; expressions like now() still rewrite. And SET NOT NULL still scans every row to validate. The playbook below works on every Postgres version and is the safe default.

Online migration playbook — adding a NOT NULL column with no extended lock:

  1. ADD COLUMN col text NULL — fast metadata-only operation. Nullable, no default.
  2. Backfill in batchesUPDATE ... WHERE col IS NULL AND id BETWEEN $1 AND $2 in chunks of 10k–100k rows, with a sleep between batches so autovacuum can catch up.
  3. Application writes the new column — deploy code that always sets col on INSERT/UPDATE.
  4. Validate then promote: ALTER TABLE ... ADD CONSTRAINT col_not_null CHECK (col IS NOT NULL) NOT VALID (instant); ALTER TABLE ... VALIDATE CONSTRAINT col_not_null (online row scan, no lock); finally SET NOT NULL is now instant because the constraint already proved it.

Four steps, zero exclusive locks longer than a millisecond. Every senior DBA knows this dance; it's not optional on a hot OLTP table.

47 minutes vs five sub-second locks

NAIVE  one ALTER, ACCESS EXCLUSIVE held the entire time:

  t=0 │█████████████████████████████████████████████████│ t=47min   ☢ outage

ONLINE  same column, five short steps; no exclusive lock >1 ms:

  ┌─────────┬──────────────────────┬──────────┬─────────────────┬──────────┐
  │ step 1  │ step 2 (batched)     │ step 3   │ step 4 (online) │ step 5   │
  ├─────────┼──────────────────────┼──────────┼─────────────────┼──────────┤
  │ ADD COL │ UPDATE … LIMIT k     │ ADD      │ VALIDATE        │ SET NOT  │
  │ NULL    │   + sleep × N        │ CHECK    │ CONSTRAINT      │ NULL     │
  │         │                      │ NOT      │ (full row scan, │          │
  │         │                      │ VALID    │  no lock)       │          │
  ├─────────┼──────────────────────┼──────────┼─────────────────┼──────────┤
  │ <1 ms   │ minutes – hours      │ <1 ms    │ long, online    │ <1 ms    │
  │ no lock │ short locks per chunk│ brief    │ no exclusive    │ metadata │
  └─────────┴──────────────────────┴──────────┴─────────────────┴──────────┘

Five operations, none holding an exclusive lock for more than a millisecond. Total wall-clock is longer than the naive ALTER (you trickle the backfill instead of blocking the world), but the site stays up. That's the trade you make every time.

The audit-column starter kit every long-lived table deserves
ALTER TABLE orders
  ADD COLUMN created_at     timestamptz NOT NULL DEFAULT now(),  -- when the row was first written
  ADD COLUMN updated_at     timestamptz NOT NULL DEFAULT now(),  -- last write
  ADD COLUMN loaded_at      timestamptz NOT NULL DEFAULT now(),  -- when the warehouse ingested it
  ADD COLUMN source_system  text         NOT NULL DEFAULT 'oltp', -- where it came from
  ADD COLUMN batch_id       text;                                  -- ingestion batch identifier (nullable)

-- Trigger to keep updated_at honest:
CREATE OR REPLACE FUNCTION touch_updated_at() RETURNS trigger AS $$
BEGIN
  NEW.updated_at := now();
  RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER orders_touch_updated_at
  BEFORE UPDATE ON orders
  FOR EACH ROW EXECUTE FUNCTION touch_updated_at();

Why every column above earns its keep

  • created_at — answers "when was this row born?". Every cohort, retention, funnel and incident report depends on it.
  • updated_at — the watermark for incremental loads. Without it, every dbt incremental model has to be rebuilt from scratch every night.
  • loaded_atwhen the warehouse saw this row. Disagrees with updated_at when the source is delayed; the difference is your ingestion latency, the most under-monitored SLO in data engineering.
  • source_system — when you eventually merge data from two systems (M&A, region split), you'll know which row came from where without writing a WHERE clause based on column values.
  • batch_id — when a load fails halfway, you need to delete just this batch's rows and re-run. Without batch_id, your only options are full reload or surgical SQL archaeology.

You need to add a `NOT NULL` `refund_reason text` column to a 200M-row `orders` table on a hot OLTP database. Which sequence is safe?

Takeaway: schemas drift, but they shouldn't drift at the cost of an outage. Audit columns are the cheap insurance every long-lived table needs. Online migrations are a four-step ritual you do every time, even on "small" tables, because small and busy are different things.