GROUP BY: COUNT, FILTER, and the dashboard that lied for a year
Aggregation done right: COUNT(*) vs COUNT(col), the FILTER clause, conditional aggregates, and how a missing FILTER cost a team a year of bad numbers.
The dashboard that lied for a year
A SaaS team I knew built a "daily active users by plan" dashboard:
SELECT plan, COUNT(*) AS dau
FROM events
GROUP BY plan;
Looked clean. Was wrong every day for a year. The bug: events had multiple rows per user per day. COUNT(*) counted events, not users. They had been congratulating themselves on growth that was actually feature usage by the same people.
The fix is one keyword: COUNT(DISTINCT user_id). The lesson is bigger: aggregations are the easiest place to confuse what you're counting. Every aggregate has a hidden grain question: one count per what?
COUNT(*) vs COUNT(col) vs COUNT(DISTINCT col)
COUNT(*)counts rows, NULL or not.COUNT(col)counts non-NULL values ofcol. After a LEFT JOIN this is what you want — the NULL placeholder doesn't count as a match.COUNT(DISTINCT col)counts distinct non-NULL values. Much more expensive (an internal sort + dedup) — fine on small data, a problem on billions of rows.
If the dashboard story above had used COUNT(DISTINCT user_id) instead of COUNT(*), the year of bad numbers wouldn't have happened. Every aggregate is a statement about grain: one count per what, of what?
-- The old way (still works, harder to read):
SELECT customer_id,
COUNT(CASE WHEN status = 'paid' THEN 1 END) AS paid_orders,
SUM(CASE WHEN status = 'paid' THEN total_cents END) AS paid_revenue
FROM orders
GROUP BY customer_id;
-- The modern way (since SQL:2003 — Postgres has had it since 9.4):
SELECT customer_id,
COUNT(*) FILTER (WHERE status = 'paid') AS paid_orders,
COUNT(*) FILTER (WHERE status = 'refunded') AS refunded_orders,
SUM(total_cents) FILTER (WHERE status = 'paid') AS paid_revenue,
SUM(total_cents) FILTER (WHERE status = 'refunded') AS refunded_amount
FROM orders
GROUP BY customer_id;One trick worth its 60 seconds: GROUPING SETS / ROLLUP / CUBE. They let one query produce multiple GROUP BY levels at once.
ROLLUP (a, b)produces grouping by(a, b)and(a)and()(grand total). Perfect for "total per category and a grand total at the bottom" in one query.CUBE (a, b)produces every combination —(a, b),(a),(b),(). Useful for pivot-style reports.GROUPING SETS ((a, b), (a))is the manual form: list exactly the groupings you want.
The NULL in the result column tells you which level you're at. Use GROUPING(col) to disambiguate — it returns 0 if col is part of the grouping, 1 if it's the rollup placeholder.
For each country, return the number of customers and the total paid revenue in cents from those customers. Countries with **zero paid revenue** must still appear with `0` (not NULL). Output: `country`, `customers`, `paid_revenue_cents`, sorted by `paid_revenue_cents` descending then `country` ascending.
Takeaway: every aggregate has a grain. Read it as "one X per Y, where the X is COUNT/SUM/AVG/... of Z". If you can't say it out loud cleanly, the query is wrong before you run it.