← Deep SQL & Production Data Engineering

GROUP BY: COUNT, FILTER, and the dashboard that lied for a year

Aggregation done right: COUNT(*) vs COUNT(col), the FILTER clause, conditional aggregates, and how a missing FILTER cost a team a year of bad numbers.

The dashboard that lied for a year

A SaaS team I knew built a "daily active users by plan" dashboard:

SELECT plan, COUNT(*) AS dau
FROM events
GROUP BY plan;

Looked clean. Was wrong every day for a year. The bug: events had multiple rows per user per day. COUNT(*) counted events, not users. They had been congratulating themselves on growth that was actually feature usage by the same people.

The fix is one keyword: COUNT(DISTINCT user_id). The lesson is bigger: aggregations are the easiest place to confuse what you're counting. Every aggregate has a hidden grain question: one count per what?

COUNT(*) vs COUNT(col) vs COUNT(DISTINCT col)

COUNT(*) counts rows, NULL or not.
COUNT(col) counts non-NULL values of col. After a LEFT JOIN this is what you want — the NULL placeholder doesn't count as a match.
COUNT(DISTINCT col) counts distinct non-NULL values. Much more expensive (an internal sort + dedup) — fine on small data, a problem on billions of rows.

If the dashboard story above had used COUNT(DISTINCT user_id) instead of COUNT(*), the year of bad numbers wouldn't have happened. Every aggregate is a statement about grain: one count per what, of what?

FILTER: cleaner than CASE WHEN inside aggregates

-- The old way (still works, harder to read):
SELECT customer_id,
       COUNT(CASE WHEN status = 'paid' THEN 1 END) AS paid_orders,
       SUM(CASE WHEN status = 'paid' THEN total_cents END) AS paid_revenue
FROM orders
GROUP BY customer_id;

-- The modern way (since SQL:2003 — Postgres has had it since 9.4):
SELECT customer_id,
       COUNT(*) FILTER (WHERE status = 'paid')           AS paid_orders,
       COUNT(*) FILTER (WHERE status = 'refunded')       AS refunded_orders,
       SUM(total_cents) FILTER (WHERE status = 'paid')   AS paid_revenue,
       SUM(total_cents) FILTER (WHERE status = 'refunded') AS refunded_amount
FROM orders
GROUP BY customer_id;

One trick worth its 60 seconds: GROUPING SETS / ROLLUP / CUBE. They let one query produce multiple GROUP BY levels at once.

ROLLUP (a, b) produces grouping by (a, b) and (a) and () (grand total). Perfect for "total per category and a grand total at the bottom" in one query.
CUBE (a, b) produces every combination — (a, b), (a), (b), (). Useful for pivot-style reports.
GROUPING SETS ((a, b), (a)) is the manual form: list exactly the groupings you want.

The NULL in the result column tells you which level you're at. Use GROUPING(col) to disambiguate — it returns 0 if col is part of the grouping, 1 if it's the rollup placeholder.

For each country, return the number of customers and the total paid revenue in cents from those customers. Countries with **zero paid revenue** must still appear with `0` (not NULL). Output: `country`, `customers`, `paid_revenue_cents`, sorted by `paid_revenue_cents` descending then `country` ascending.

Takeaway: every aggregate has a grain. Read it as "one X per Y, where the X is COUNT/SUM/AVG/... of Z". If you can't say it out loud cleanly, the query is wrong before you run it.