Codabra

GROUP BY: COUNT, FILTER, and the dashboard that lied for a year

Aggregation done right: COUNT(*) vs COUNT(col), the FILTER clause, conditional aggregates, and how a missing FILTER cost a team a year of bad numbers.

The dashboard that lied for a year

A SaaS team I knew built a "daily active users by plan" dashboard:

SELECT plan, COUNT(*) AS dau
FROM events
GROUP BY plan;

Looked clean. Was wrong every day for a year. The bug: events had multiple rows per user per day. COUNT(*) counted events, not users. They had been congratulating themselves on growth that was actually feature usage by the same people.

The fix is one keyword: COUNT(DISTINCT user_id). The lesson is bigger: aggregations are the easiest place to confuse what you're counting. Every aggregate has a hidden grain question: one count per what?

COUNT(*) vs COUNT(col) vs COUNT(DISTINCT col)

  • COUNT(*) counts rows, NULL or not.
  • COUNT(col) counts non-NULL values of col. After a LEFT JOIN this is what you want — the NULL placeholder doesn't count as a match.
  • COUNT(DISTINCT col) counts distinct non-NULL values. Much more expensive (an internal sort + dedup) — fine on small data, a problem on billions of rows.

If the dashboard story above had used COUNT(DISTINCT user_id) instead of COUNT(*), the year of bad numbers wouldn't have happened. Every aggregate is a statement about grain: one count per what, of what?

FILTER: cleaner than CASE WHEN inside aggregates
-- The old way (still works, harder to read):
SELECT customer_id,
       COUNT(CASE WHEN status = 'paid' THEN 1 END) AS paid_orders,
       SUM(CASE WHEN status = 'paid' THEN total_cents END) AS paid_revenue
FROM orders
GROUP BY customer_id;

-- The modern way (since SQL:2003 — Postgres has had it since 9.4):
SELECT customer_id,
       COUNT(*) FILTER (WHERE status = 'paid')           AS paid_orders,
       COUNT(*) FILTER (WHERE status = 'refunded')       AS refunded_orders,
       SUM(total_cents) FILTER (WHERE status = 'paid')   AS paid_revenue,
       SUM(total_cents) FILTER (WHERE status = 'refunded') AS refunded_amount
FROM orders
GROUP BY customer_id;

One trick worth its 60 seconds: GROUPING SETS / ROLLUP / CUBE. They let one query produce multiple GROUP BY levels at once.

  • ROLLUP (a, b) produces grouping by (a, b) and (a) and () (grand total). Perfect for "total per category and a grand total at the bottom" in one query.
  • CUBE (a, b) produces every combination — (a, b), (a), (b), (). Useful for pivot-style reports.
  • GROUPING SETS ((a, b), (a)) is the manual form: list exactly the groupings you want.

The NULL in the result column tells you which level you're at. Use GROUPING(col) to disambiguate — it returns 0 if col is part of the grouping, 1 if it's the rollup placeholder.

For each country, return the number of customers and the total paid revenue in cents from those customers. Countries with **zero paid revenue** must still appear with `0` (not NULL). Output: `country`, `customers`, `paid_revenue_cents`, sorted by `paid_revenue_cents` descending then `country` ascending.

Takeaway: every aggregate has a grain. Read it as "one X per Y, where the X is COUNT/SUM/AVG/... of Z". If you can't say it out loud cleanly, the query is wrong before you run it.