Star schema in one lesson — and why dashboards run 100× faster on it
The geometry of analytics: one fact in the middle, many dimensions around it. Why this beats both 'one giant flat table' and 'fully normalized OLTP source'.
"Why is the dashboard so slow?" — the modeling answer
A BI team I knew built dashboards directly on the OLTP schema. Every chart joined 7 tables to translate IDs into names. The dashboard took 12 seconds to load. The DBA tuned indexes for a week. The query optimizer was given new statistics. Nothing helped.
The issue was not the queries. It was the schema. The OLTP schema is normalized for writes — minimum duplication, every fact in one place. Dashboards do reads — wide aggregations across many rows, repeated thousands of times a day. The two workloads want opposite shapes.
Moving to a star schema (one fact_orders denormalized with frequently-joined dimension keys, plus small dim_customer, dim_product, dim_date) dropped the dashboard load time from 12s to 80ms. Same data, same database, different shape.
Fact and dimension
- A fact holds one row per business event with measurable numbers (revenue, count, duration, latency). Grows fast.
- A dimension holds attributes for slicing (customer, product, date, country, plan). Grows slowly; ideally only when reality changes.
A star schema is one fact in the middle joined to many dimensions. Compared to the OLTP schema, it is denormalized on purpose — dashboards only need short joins on small dimensions. The denormalization buys read speed; the cost is write complexity (when an attribute changes, you may need to update many fact rows or use SCD2 — next lesson).
The star, drawn
That is the geometry the word star refers to: one fact in the middle, dimensions hanging off the points. Every dashboard query is FROM fact JOIN dim_a JOIN dim_b GROUP BY dim_a.attribute, dim_b.attribute — short joins on small tables, no recursion through seven normalized levels. The DDL below is the same picture, in code.
-- one row per order (the grain — write it as a comment, always)
CREATE TABLE marts.fact_orders (
order_pk bigserial PRIMARY KEY, -- surrogate key
order_id bigint NOT NULL UNIQUE, -- source business key
date_key date NOT NULL REFERENCES marts.dim_date(date_key),
customer_key bigint NOT NULL REFERENCES marts.dim_customer(customer_key),
product_key bigint NOT NULL REFERENCES marts.dim_product(product_key),
status text NOT NULL,
total_cents integer NOT NULL,
loaded_at timestamptz NOT NULL DEFAULT now()
);
-- one row per (calendar) day — pre-built, never updated
CREATE TABLE marts.dim_date (
date_key date PRIMARY KEY,
year integer NOT NULL,
quarter integer NOT NULL,
month integer NOT NULL,
day_of_week integer NOT NULL,
is_weekend boolean NOT NULL,
is_holiday boolean NOT NULL DEFAULT false
);
-- one row per customer (current state) — see SCD2 lesson for full history
CREATE TABLE marts.dim_customer (
customer_key bigint PRIMARY KEY,
customer_id bigint NOT NULL UNIQUE,
name text NOT NULL,
country text,
plan text,
signup_date date NOT NULL
);Two anti-patterns to avoid.
-
One Giant Table (OGT) — denormalize everything into one wide table with hundreds of columns. Reads are fast for the one query you optimized for; everything else is a slog. Schema migrations are painful. Storage cost is high. Tempting, especially in column-stores; resist.
-
Mirror the OLTP schema in marts — same normalization, same join graph. Loses the read-speed advantage of the warehouse. Common when teams "just dump prod into BigQuery". Use bronze/silver/gold to layer it: bronze mirrors OLTP, silver normalizes/cleans, gold is the star schema.
Pretend the `shop_basic` schema is the *source*. Write the SQL that would build a `fact_orders` row from each order with the (already-existing) `customer_id` plus a derived `month` and `is_weekend` flag — exactly the columns a dashboard would slice on. Return columns `order_id`, `customer_id`, `month` (date truncated to first of month), `is_weekend` (true if `created_at` is Saturday or Sunday), `total_cents`, ordered by `order_id`.
Takeaway: a star schema is one fact + many small dimensions, denormalized for read speed. It's not the only model (Data Vault, OBT) but it's the one every BI tool understands and every analyst expects. Grain at the top, in a comment, always.