← Deep SQL & Production Data Engineering

Tables, keys and grain — or how I shipped a 30% revenue bug

What a row really represents, and why getting the grain right is the most important modeling decision you will make this year.

The 30% bug

My first month as a junior analyst, I shipped a dashboard called "Revenue by category". Marketing loved it. Sales sent it to the CEO. Two weeks later finance flagged the totals — they were 30% higher than the books.

The SQL was a single line:

SELECT category, SUM(price_cents) FROM order_items GROUP BY category;

Looks innocent. It is wrong.

order_items has one row per item — a single order with three items contributes three rows. So SUM(price_cents) summed the line items correctly, but finance was computing revenue from orders.total_cents which already included multi-item discounts and shipping. My number was the gross of items; theirs was the net to the company. Same word, different grain.

This lesson exists so that doesn't happen to you.

What is a table, really?

A table is a relation: a set of rows where every row has the same shape. Two ideas matter more than the syntax:

The primary key identifies a row uniquely. No two rows have the same key, no key is NULL.
The grain is the answer to what does one row mean? — for example one row per order, one row per (user, day, country), one row per order line item.

Grain is a property of the table, not the column names. A column called revenue on a daily_user_country_revenue table means one row per (user, day, country), not per dollar.

The schema you'll write against, in one picture

Grain in one sentence per table: CUSTOMERS — one row per customer; ORDERS — one row per order; PRODUCTS — one row per SKU; ORDER_ITEMS — one row per (order, product). The 30% revenue bug above came from confusing one row per item (order_items) with one row per order (orders). Pin the picture to the top of your monitor for Module 02 — every JOIN you write for the next four hours runs through these four tables.

The grain test. If you cannot describe a table's grain in one sentence — "one row per X" — the model is unclear and every metric on top is ambiguous. "Number of orders" on order_items (one row per item) is not the same as on orders (one row per order). Ask the grain. Always.

A minimal e-commerce schema, with grain commented

-- one row per customer
CREATE TABLE customers (
  customer_id  bigint PRIMARY KEY,
  email        text NOT NULL UNIQUE,
  name         text NOT NULL,
  country      text NOT NULL,
  created_at   timestamptz NOT NULL DEFAULT now()
);

-- one row per product (SKU is the natural key, product_id is the surrogate)
CREATE TABLE products (
  product_id   bigint PRIMARY KEY,
  sku          text NOT NULL UNIQUE,
  title        text NOT NULL,
  category     text NOT NULL,
  price_cents  integer NOT NULL CHECK (price_cents >= 0)
);

-- one row per order
CREATE TABLE orders (
  order_id     bigint PRIMARY KEY,
  customer_id  bigint NOT NULL REFERENCES customers(customer_id),
  status       text NOT NULL,
  created_at   timestamptz NOT NULL,
  total_cents  integer NOT NULL  -- denormalized for read speed; recompute on change
);

-- one row per (order, product). Composite primary key.
CREATE TABLE order_items (
  order_id     bigint NOT NULL REFERENCES orders(order_id),
  product_id   bigint NOT NULL REFERENCES products(product_id),
  quantity     integer NOT NULL CHECK (quantity > 0),
  price_cents  integer NOT NULL,
  PRIMARY KEY (order_id, product_id)
);

Data type choices that bite later.

Money in float or double — the price of every horror story you've heard about "$0.01 differences in the books". Use integer cents or numeric(12,2). Never float.
Time without zone (timestamp) — works until you ship to a second region. Use timestamptz (which stores UTC and converts on read).
varchar(50) — stored identically to text under the hood; the length cap is enforced as a check constraint, not a storage win. The length is rarely useful and almost always becomes a future migration.

Find every customer and the number of paid orders they have made. Return columns `customer_id`, `name`, `paid_orders`, sorted by `paid_orders` descending then by `name` ascending. **Customers with zero paid orders should still appear with `0`** — the boring detail that makes this exercise real.

Takeaway: name the grain, write it in a comment above every CREATE TABLE, and most aggregation bugs disappear before you write the SQL.