JOIN types — and the 47 customers nobody noticed were missing
INNER, LEFT, RIGHT, FULL, CROSS, SEMI, ANTI — what each does, how to anticipate the row count, and how INNER quietly hides bugs.
The 47 missing customers
A growth team I knew built a churn report with one query:
SELECT c.customer_id, MAX(o.created_at) AS last_order
FROM customers c
JOIN orders o ON o.customer_id = c.customer_id
GROUP BY c.customer_id;
They segmented the result, called every customer whose last_order was older than 90 days, ran a recovery campaign, and the metrics looked great.
Three months later they noticed the real churn was higher than reported. The bug: JOIN defaults to INNER, which drops every customer with zero orders — the most-churned customers of all. 47 of them. They were never on the call list.
The fix is one keyword: LEFT JOIN. The lesson is bigger: every JOIN you write makes a cardinality statement. Get the statement wrong and your data lies to you, quietly, for as long as nobody runs COUNT(*).
JOINs at a glance — what survives, in pictures
Two input tables L and R, sharing one matching key (b ↔ 2) and otherwise disjoint:
L R
┌────────┐ ┌────────┐
│ a │ 1 │ │ 2 │ X │
│ b │ 2 │ │ 3 │ Y │
│ c │ 4 │ │ 5 │ Z │
└────────┘ └────────┘
What each JOIN type keeps (▓ = matched row, █ = kept-with-NULL filler):
INNER L⋈R LEFT L⟕R RIGHT L⟖R FULL L⟗R
L.k │ R.k L.k │ R.k L.k │ R.k L.k │ R.k
────┼──── ────┼──── ────┼──── ────┼────
b │ 2 ▓ a │ █ █ │ 3 a │ █
b │ 2 ▓ █ │ 5 b │ 2 ▓
c │ █ b │ 2 ▓ c │ █
█ │ 3
█ │ 5
1 row 3 rows 3 rows 5 rows
SEMI (EXISTS) ANTI (NOT EXISTS) CROSS (every pair)
L.k L.k 3 × 3 = 9 rows
──── ──── (a,2) (a,3) (a,5)
b a (b,2) (b,3) (b,5)
c (c,2) (c,3) (c,5)
1 row 2 rows 9 rows
The trap of INNER is that it looks like the default — and it is — but it makes the strongest assumption: every left row has a match. Real-world data violates that assumption all the time. SEMI and ANTI are the under-used pair: "does this row have at least one match?" and "…or none?", with no fanout.
SEMI JOIN answers exists at least one match? without duplicating left rows. Use WHERE EXISTS (…) or IN (…).
ANTI JOIN answers has no match? — WHERE NOT EXISTS (…). Cleaner than LEFT JOIN … WHERE right.id IS NULL and harder to mess up with NULLs (because LEFT JOIN … IS NULL returns wrong results when the right column itself can be NULL).
The most common LEFT JOIN bug.
-- Looks like LEFT JOIN. Behaves like INNER JOIN.
SELECT c.*
FROM customers c
LEFT JOIN orders o ON o.customer_id = c.customer_id
WHERE o.status = 'paid';
Why? WHERE o.status = 'paid' runs after the LEFT JOIN inserts NULL placeholders for unmatched rows. NULL = 'paid' is NULL, which is dropped. So every customer with no orders disappears.
Fix: move the predicate into the JOIN clause: LEFT JOIN orders o ON o.customer_id = c.customer_id AND o.status = 'paid'. Now unmatched customers stay, and unwanted statuses are excluded before the LEFT side gets its NULL padding. (When you're aggregating instead of selecting, the same idea works as COUNT(*) FILTER (WHERE o.status = 'paid').)
Find every customer who has placed **zero paid orders** — the people who created an account but never converted. Return `customer_id`, `name`, ordered by `customer_id`. Use an ANTI JOIN pattern (`NOT EXISTS`) — not LEFT JOIN + IS NULL.
Find every customer who has placed at least one paid order **and** at least one refunded order. Return `customer_id`, `name`, ordered by `customer_id`. Two semi-joins is cleaner than two JOINs + DISTINCT.
Takeaway: every JOIN encodes a cardinality assumption. Default to writing LEFT JOIN when in doubt — INNER silently drops rows; LEFT keeps them visible so the bug is loud, not silent.