Online migrations: adding NOT NULL to a 200M-row table without taking the site down
What every long-lived table should carry, and the four-step migration playbook that turns a 30-minute table-rewriting lock into a series of millisecond operations.
The deploy that took a 200M-row table offline at peak
A team I knew shipped a migration on a Friday at 4 PM (already a mistake):
ALTER TABLE orders
ADD COLUMN refund_reason text NOT NULL DEFAULT '';
In Postgres ≤10 this rewrites the entire table — 200M rows, ~80 GB, with an ACCESS EXCLUSIVE lock that blocks every reader and writer. The migration ran for 47 minutes. During that time, every checkout hit the lock and timed out. The site was down for forty-seven minutes on a Friday afternoon.
Postgres 11+ made ADD COLUMN ... DEFAULT cheap only when the default is a constant; expressions like now() still rewrite. And SET NOT NULL still scans every row to validate. The playbook below works on every Postgres version and is the safe default.
Online migration playbook — adding a NOT NULL column with no extended lock:
ADD COLUMN col text NULL— fast metadata-only operation. Nullable, no default.- Backfill in batches —
UPDATE ... WHERE col IS NULL AND id BETWEEN $1 AND $2in chunks of 10k–100k rows, with a sleep between batches so autovacuum can catch up. - Application writes the new column — deploy code that always sets
colon INSERT/UPDATE. - Validate then promote:
ALTER TABLE ... ADD CONSTRAINT col_not_null CHECK (col IS NOT NULL) NOT VALID(instant);ALTER TABLE ... VALIDATE CONSTRAINT col_not_null(online row scan, no lock); finallySET NOT NULLis now instant because the constraint already proved it.
Four steps, zero exclusive locks longer than a millisecond. Every senior DBA knows this dance; it's not optional on a hot OLTP table.
47 minutes vs five sub-second locks
NAIVE one ALTER, ACCESS EXCLUSIVE held the entire time:
t=0 │█████████████████████████████████████████████████│ t=47min ☢ outage
ONLINE same column, five short steps; no exclusive lock >1 ms:
┌─────────┬──────────────────────┬──────────┬─────────────────┬──────────┐
│ step 1 │ step 2 (batched) │ step 3 │ step 4 (online) │ step 5 │
├─────────┼──────────────────────┼──────────┼─────────────────┼──────────┤
│ ADD COL │ UPDATE … LIMIT k │ ADD │ VALIDATE │ SET NOT │
│ NULL │ + sleep × N │ CHECK │ CONSTRAINT │ NULL │
│ │ │ NOT │ (full row scan, │ │
│ │ │ VALID │ no lock) │ │
├─────────┼──────────────────────┼──────────┼─────────────────┼──────────┤
│ <1 ms │ minutes – hours │ <1 ms │ long, online │ <1 ms │
│ no lock │ short locks per chunk│ brief │ no exclusive │ metadata │
└─────────┴──────────────────────┴──────────┴─────────────────┴──────────┘
Five operations, none holding an exclusive lock for more than a millisecond. Total wall-clock is longer than the naive ALTER (you trickle the backfill instead of blocking the world), but the site stays up. That's the trade you make every time.
ALTER TABLE orders
ADD COLUMN created_at timestamptz NOT NULL DEFAULT now(), -- when the row was first written
ADD COLUMN updated_at timestamptz NOT NULL DEFAULT now(), -- last write
ADD COLUMN loaded_at timestamptz NOT NULL DEFAULT now(), -- when the warehouse ingested it
ADD COLUMN source_system text NOT NULL DEFAULT 'oltp', -- where it came from
ADD COLUMN batch_id text; -- ingestion batch identifier (nullable)
-- Trigger to keep updated_at honest:
CREATE OR REPLACE FUNCTION touch_updated_at() RETURNS trigger AS $$
BEGIN
NEW.updated_at := now();
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER orders_touch_updated_at
BEFORE UPDATE ON orders
FOR EACH ROW EXECUTE FUNCTION touch_updated_at();Why every column above earns its keep
created_at— answers "when was this row born?". Every cohort, retention, funnel and incident report depends on it.updated_at— the watermark for incremental loads. Without it, every dbt incremental model has to be rebuilt from scratch every night.loaded_at— when the warehouse saw this row. Disagrees withupdated_atwhen the source is delayed; the difference is your ingestion latency, the most under-monitored SLO in data engineering.source_system— when you eventually merge data from two systems (M&A, region split), you'll know which row came from where without writing aWHEREclause based on column values.batch_id— when a load fails halfway, you need to delete just this batch's rows and re-run. Withoutbatch_id, your only options are full reload or surgical SQL archaeology.
You need to add a `NOT NULL` `refund_reason text` column to a 200M-row `orders` table on a hot OLTP database. Which sequence is safe?
Takeaway: schemas drift, but they shouldn't drift at the cost of an outage. Audit columns are the cheap insurance every long-lived table needs. Online migrations are a four-step ritual you do every time, even on "small" tables, because small and busy are different things.