Tables, keys and grain — or how I shipped a 30% revenue bug
What a row really represents, and why getting the grain right is the most important modeling decision you will make this year.
The 30% bug
My first month as a junior analyst, I shipped a dashboard called "Revenue by category". Marketing loved it. Sales sent it to the CEO. Two weeks later finance flagged the totals — they were 30% higher than the books.
The SQL was a single line:
SELECT category, SUM(price_cents) FROM order_items GROUP BY category;
Looks innocent. It is wrong.
order_items has one row per item — a single order with three items contributes three rows. So SUM(price_cents) summed the line items correctly, but finance was computing revenue from orders.total_cents which already included multi-item discounts and shipping. My number was the gross of items; theirs was the net to the company. Same word, different grain.
This lesson exists so that doesn't happen to you.
What is a table, really?
A table is a relation: a set of rows where every row has the same shape. Two ideas matter more than the syntax:
- The primary key identifies a row uniquely. No two rows have the same key, no key is NULL.
- The grain is the answer to what does one row mean? — for example one row per order, one row per (user, day, country), one row per order line item.
Grain is a property of the table, not the column names. A column called revenue on a daily_user_country_revenue table means one row per (user, day, country), not per dollar.
The schema you'll write against, in one picture
Grain in one sentence per table: CUSTOMERS — one row per customer; ORDERS — one row per order; PRODUCTS — one row per SKU; ORDER_ITEMS — one row per (order, product). The 30% revenue bug above came from confusing one row per item (order_items) with one row per order (orders). Pin the picture to the top of your monitor for Module 02 — every JOIN you write for the next four hours runs through these four tables.
The grain test. If you cannot describe a table's grain in one sentence — "one row per X" — the model is unclear and every metric on top is ambiguous. "Number of orders" on order_items (one row per item) is not the same as on orders (one row per order). Ask the grain. Always.
-- one row per customer
CREATE TABLE customers (
customer_id bigint PRIMARY KEY,
email text NOT NULL UNIQUE,
name text NOT NULL,
country text NOT NULL,
created_at timestamptz NOT NULL DEFAULT now()
);
-- one row per product (SKU is the natural key, product_id is the surrogate)
CREATE TABLE products (
product_id bigint PRIMARY KEY,
sku text NOT NULL UNIQUE,
title text NOT NULL,
category text NOT NULL,
price_cents integer NOT NULL CHECK (price_cents >= 0)
);
-- one row per order
CREATE TABLE orders (
order_id bigint PRIMARY KEY,
customer_id bigint NOT NULL REFERENCES customers(customer_id),
status text NOT NULL,
created_at timestamptz NOT NULL,
total_cents integer NOT NULL -- denormalized for read speed; recompute on change
);
-- one row per (order, product). Composite primary key.
CREATE TABLE order_items (
order_id bigint NOT NULL REFERENCES orders(order_id),
product_id bigint NOT NULL REFERENCES products(product_id),
quantity integer NOT NULL CHECK (quantity > 0),
price_cents integer NOT NULL,
PRIMARY KEY (order_id, product_id)
);Data type choices that bite later.
- Money in
floatordouble— the price of every horror story you've heard about "$0.01 differences in the books". Useintegercents ornumeric(12,2). Neverfloat. - Time without zone (
timestamp) — works until you ship to a second region. Usetimestamptz(which stores UTC and converts on read). varchar(50)— stored identically totextunder the hood; the length cap is enforced as a check constraint, not a storage win. The length is rarely useful and almost always becomes a future migration.
Find every customer and the number of paid orders they have made. Return columns `customer_id`, `name`, `paid_orders`, sorted by `paid_orders` descending then by `name` ascending. **Customers with zero paid orders should still appear with `0`** — the boring detail that makes this exercise real.
Takeaway: name the grain, write it in a comment above every CREATE TABLE, and most aggregation bugs disappear before you write the SQL.