Codabra

Star schema in one lesson — and why dashboards run 100× faster on it

The geometry of analytics: one fact in the middle, many dimensions around it. Why this beats both 'one giant flat table' and 'fully normalized OLTP source'.

"Why is the dashboard so slow?" — the modeling answer

A BI team I knew built dashboards directly on the OLTP schema. Every chart joined 7 tables to translate IDs into names. The dashboard took 12 seconds to load. The DBA tuned indexes for a week. The query optimizer was given new statistics. Nothing helped.

The issue was not the queries. It was the schema. The OLTP schema is normalized for writes — minimum duplication, every fact in one place. Dashboards do reads — wide aggregations across many rows, repeated thousands of times a day. The two workloads want opposite shapes.

Moving to a star schema (one fact_orders denormalized with frequently-joined dimension keys, plus small dim_customer, dim_product, dim_date) dropped the dashboard load time from 12s to 80ms. Same data, same database, different shape.

Fact and dimension

  • A fact holds one row per business event with measurable numbers (revenue, count, duration, latency). Grows fast.
  • A dimension holds attributes for slicing (customer, product, date, country, plan). Grows slowly; ideally only when reality changes.

A star schema is one fact in the middle joined to many dimensions. Compared to the OLTP schema, it is denormalized on purpose — dashboards only need short joins on small dimensions. The denormalization buys read speed; the cost is write complexity (when an attribute changes, you may need to update many fact rows or use SCD2 — next lesson).

The star, drawn

That is the geometry the word star refers to: one fact in the middle, dimensions hanging off the points. Every dashboard query is FROM fact JOIN dim_a JOIN dim_b GROUP BY dim_a.attribute, dim_b.attribute — short joins on small tables, no recursion through seven normalized levels. The DDL below is the same picture, in code.

A minimal star: one fact, three dimensions
-- one row per order (the grain — write it as a comment, always)
CREATE TABLE marts.fact_orders (
  order_pk      bigserial PRIMARY KEY,                                      -- surrogate key
  order_id      bigint NOT NULL UNIQUE,                                     -- source business key
  date_key      date    NOT NULL REFERENCES marts.dim_date(date_key),
  customer_key  bigint  NOT NULL REFERENCES marts.dim_customer(customer_key),
  product_key   bigint  NOT NULL REFERENCES marts.dim_product(product_key),
  status        text    NOT NULL,
  total_cents   integer NOT NULL,
  loaded_at     timestamptz NOT NULL DEFAULT now()
);

-- one row per (calendar) day — pre-built, never updated
CREATE TABLE marts.dim_date (
  date_key   date PRIMARY KEY,
  year       integer NOT NULL,
  quarter    integer NOT NULL,
  month      integer NOT NULL,
  day_of_week integer NOT NULL,
  is_weekend boolean NOT NULL,
  is_holiday boolean NOT NULL DEFAULT false
);

-- one row per customer (current state) — see SCD2 lesson for full history
CREATE TABLE marts.dim_customer (
  customer_key  bigint PRIMARY KEY,
  customer_id   bigint NOT NULL UNIQUE,
  name          text NOT NULL,
  country       text,
  plan          text,
  signup_date   date NOT NULL
);

Two anti-patterns to avoid.

  1. One Giant Table (OGT) — denormalize everything into one wide table with hundreds of columns. Reads are fast for the one query you optimized for; everything else is a slog. Schema migrations are painful. Storage cost is high. Tempting, especially in column-stores; resist.

  2. Mirror the OLTP schema in marts — same normalization, same join graph. Loses the read-speed advantage of the warehouse. Common when teams "just dump prod into BigQuery". Use bronze/silver/gold to layer it: bronze mirrors OLTP, silver normalizes/cleans, gold is the star schema.

Pretend the `shop_basic` schema is the *source*. Write the SQL that would build a `fact_orders` row from each order with the (already-existing) `customer_id` plus a derived `month` and `is_weekend` flag — exactly the columns a dashboard would slice on. Return columns `order_id`, `customer_id`, `month` (date truncated to first of month), `is_weekend` (true if `created_at` is Saturday or Sunday), `total_cents`, ordered by `order_id`.

Takeaway: a star schema is one fact + many small dimensions, denormalized for read speed. It's not the only model (Data Vault, OBT) but it's the one every BI tool understands and every analyst expects. Grain at the top, in a comment, always.