Codabra

Slowly Changing Dimensions: SCD2 — keeping history without overwriting it

How to answer 'what plan was Ada on when she placed this order?' months after she switched. Effective dates, the trick that lets dimensions remember the past.

The CFO question that broke the dimension table

A finance team asked: "What was the plan distribution of customers who placed orders last March?". The DE wrote the obvious join:

SELECT c.plan, COUNT(o.order_id)
FROM dim_customer c
JOIN fact_orders o USING (customer_key)
WHERE o.date_key BETWEEN '2025-03-01' AND '2025-03-31'
GROUP BY c.plan;

Numbers came back. CFO said "these don't match what I remember from March". The DE checked: yes, they matched the current plan distribution, not the March one. Half the customers had upgraded since March. The dimension stored only the current state.

The fix is SCD Type 2: keep one row per version of the dimension, with effective_from and effective_to. Joins become as-of — match the order to the dimension version that was current at the time the order happened.

SCD2 dim_customer — one row per version
CREATE TABLE marts.dim_customer (
  customer_key   bigserial PRIMARY KEY,           -- surrogate, one per version
  customer_id    bigint NOT NULL,                 -- natural key, repeats across versions
  name           text NOT NULL,
  country        text,
  plan           text NOT NULL,
  effective_from timestamptz NOT NULL,
  effective_to   timestamptz,                     -- NULL = current row
  is_current     boolean GENERATED ALWAYS AS (effective_to IS NULL) STORED,
  loaded_at      timestamptz NOT NULL DEFAULT now()
);

-- One UNIQUE "current" row per customer:
CREATE UNIQUE INDEX dim_customer_current
  ON marts.dim_customer (customer_id) WHERE is_current;

SCD2 history, on a number line

  customer_id = 1 (Ada Lovelace)

  customer_key = 17                customer_key = 42
  ├──── plan = 'free' ─────┤       ├──── plan = 'pro' ────►
  2024-01-05         2025-03-15            2025-03-15   (∞, open)
        ●                                       ●
   event #A: 2024-09-12              event #B: 2025-04-22
   → joins to key 17, plan='free'    → joins to key 42, plan='pro'

Storage in dim_customer:

 customer_key │ customer_id │ plan │ effective_from │ effective_to │ is_current
      17      │      1      │ free │  2024-01-05    │  2025-03-15  │   false
      42      │      1      │ pro  │  2025-03-15    │  NULL (= ∞)  │   true

customer_key is the version surrogate (one per row); customer_id is the natural business key (repeats across versions). When a fact row is written at the time of the event, it stores customer_key directly — no range condition at read time, just USING (customer_key). When a fact row joins retroactively, you need the range condition (event_at >= effective_from AND event_at < COALESCE(effective_to, …)).

The as-of join.

-- *"Plan distribution at the time of each order"* — joins to the historically-correct version:
SELECT c.plan, COUNT(o.order_id)
FROM fact_orders o
JOIN dim_customer c ON c.customer_id = o.customer_id
  AND o.created_at >= c.effective_from
  AND o.created_at <  COALESCE(c.effective_to, 'infinity'::timestamptz)
WHERE o.date_key BETWEEN '2025-03-01' AND '2025-03-31'
GROUP BY c.plan;

Better: store customer_key (the version surrogate) in fact_orders at write time, so the join is a single USING (customer_key) — no range condition needed at read time.

Build a tiny SCD2 lookup. Use `WITH dim(customer_id, plan, effective_from, effective_to) AS (VALUES ...)` for two customers (1 and 2), where customer 1 has two versions ('free' until 2025-03-15, then 'pro') and customer 2 has one version ('pro' since 2025-01-01). Join an inline `events` table with two events — `(1, '2025-03-01')` and `(1, '2025-04-01')` — to the historically-correct plan. Return `event_id`, `customer_id`, `plan`, `event_at` ordered by `event_id`. Expected: event 1 → 'free', event 2 → 'pro'.

Takeaway: SCD2 is one row per dimension version, with effective_from and effective_to. Range-based join in queries, or precomputed customer_key at write time. Dimensions that overwrite history make the CFO mad. Dimensions that remember don't.