The Black Friday +800% revenue bug — and the diagnostic playbook
The single most common silent bug in analytics. Reproduce it, then learn the four-step playbook that catches it before it ships.
+800% revenue, on paper
A team I worked with shipped a new revenue dashboard the week before Black Friday. The Tuesday after Black Friday, the CEO asked finance to confirm the +800% YoY growth on the dashboard. Finance, sensibly, refused.
The culprit query:
SELECT category, SUM(o.total_cents) AS revenue
FROM orders o
JOIN order_items oi ON oi.order_id = o.order_id
JOIN products p ON p.product_id = oi.product_id
GROUP BY p.category;
Looks plausible. Is an over-count by ~8× — the average order had 8 items, and each item duplicated the order's total_cents. The sum was 8× the real revenue, distributed across whichever categories the items happened to be in.
This is the many-to-many fanout — and it's the single most common silent bug in analytics. Today we make it fail loudly.
What goes wrong, in pictures
One order, four line items. After the JOIN, total_cents is repeated four times:
orders order_items
order_id │ total_cents order_id │ product_id │ qty │ price
42 │ 9 900 42 │ 101 │ 1 │ 2 500
42 │ 102 │ 2 │ 1 200
42 │ 103 │ 1 │ 3 700
42 │ 104 │ 1 │ 2 500
JOIN orders ON order_id → 4 rows (one per item), total_cents repeated 4×
SUM(total_cents) → 39 600 ☢ 4× the real revenue
correct answer → 9 900 (sum from `orders` directly, no JOIN)
The rule: the grain of the result equals the grain of the table whose join key has duplicates. The picture above is one order × four items; a million orders × eight items each is eight million rows, and your query's runtime quietly tells you so.
Diagnostic playbook — run this on every JOIN
- State the grain of each input in a comment:
-- orders: one row per order,-- order_items: one row per (order, product). - State the grain of the result before writing the JOIN. "After JOIN, one row per item."
- After running,
SELECT COUNT(*)the result and compare to your prediction. - If they differ — fix the model, do not slap on
DISTINCT.
DISTINCT is the masking tape of analytics. It hides the bug, doesn't fix it, and falls off when the data shape changes.
Return the total revenue (in cents) from **paid** orders. Output: a single row, single column named `revenue_cents`. The wrong (and tempting) answer is `JOIN order_items ... SUM(price_cents * quantity)`. The right answer keeps the order grain — no JOIN at all.
You write `SELECT COUNT(*) FROM orders o JOIN order_items oi ON oi.order_id = o.order_id`. The result is **larger** than `SELECT COUNT(*) FROM orders`. What does that tell you?
Takeaway: write the grain in a comment, predict the row count, count after, fix any gap at the model level. The whole career of bug-free analytics SQL is built on this 60-second discipline.