How JOINs run: nested loop, hash, merge — and the ANALYZE you forgot
What the optimizer actually does, and why join order can change a query's runtime by 1000×.
The query that took 4 hours overnight
A pipeline that had run in 90 seconds for a year started taking 4 hours one morning. The on-call DE bisected: nothing in the SQL had changed, no schema change, no new indexes. The only difference was that a 100M-row backfill had loaded the previous night.
The planner had been picking a hash join on the previous data shape. After the backfill, its cardinality estimate was based on stale statistics — it now thought the right side was tiny and switched to a nested loop, scanning 100M rows once per left-side row. Hence: 4 hours.
The fix took 3 seconds: ANALYZE order_items; Bulk loads do not automatically refresh statistics aggressively enough.
Three physical algorithms
The SQL INNER JOIN is logical — what you want. The optimizer picks one of three physical implementations based on size estimates:
- Nested loop — for each row on the left, scan the right (or probe an index). Fast when one side is tiny or there's an index on the right's join key. Catastrophic when both are large and unindexed: O(left × right).
- Hash join — build a hash table on the smaller side in memory, probe with the larger side. The workhorse of OLAP queries. O(left + right). Memory-bound.
- Merge join — sort both sides on the join key, walk both in lockstep. Wins when inputs are already sorted (e.g. clustered indexes, ordered scans). O(left + right) plus sort cost if not already sorted.
A fourth case worth knowing: the broadcast join in distributed engines (Spark, Trino, BigQuery) — if one side is small enough, copy it to every executor and do a hash join locally. Brutal when the "small" side isn't actually small.
What the optimizer is choosing between, in pseudocode
NESTED LOOP HASH JOIN MERGE JOIN
───────────── ───────── ──────────
for l in L: build H = {r.k: r} sort L on key
for r in R: for r in R sort R on key
if l.k == r.k: for l in L: walk both in lockstep:
emit(l, r) if l.k in H: while L.k < R.k: L++
emit(l, H[l.k]) while L.k > R.k: R++
if equal: emit, ++both
O(|L| × |R|) O(|L| + |R|) O(|L| + |R|) + sort
unless R is indexed memory-bound on |R| free if pre-sorted
───────────── ───────────── ─────────────
small × any any + any (small fits any + any with
or any + indexed key in memory) — OLAP clustered indexes
workhorse or already-sorted scans
The planner picks based on pg_statistic size estimates. When those are wrong, the planner picks badly — that's the next callout's setup.
Why the optimizer guesses wrong. Postgres draws estimates from pg_statistic — sample-based histograms, distinct counts, correlation. After a bulk load, those samples lag. Run ANALYZE after big writes (or set autovacuum_analyze_scale_factor lower for that table). For a deeper dive on plans, see Module 7.
EXPLAIN (ANALYZE, BUFFERS)
SELECT c.name, COUNT(*)
FROM customers c
JOIN orders o ON o.customer_id = c.customer_id
WHERE o.status = 'paid'
GROUP BY c.name;
-- Look for:
-- "Hash Join" -- the algorithm picked
-- "rows=N" -- the estimate
-- "actual rows=M" -- what really happened
-- If estimate and actual differ by >10x, your stats are bad.Your aggregation query has been fast for months. Last night a 100M-row backfill ran. This morning the query is 100× slower. The most likely cause?
In Spark, you set `spark.sql.autoBroadcastJoinThreshold = 100MB` and a query JOINs a 90 MB lookup table to a 1 TB facts table. What does the planner *probably* do, and what's the failure mode?
Takeaway: the algorithm choice is the planner's, but its inputs are your statistics. Run ANALYZE after big writes. Read EXPLAIN ANALYZE and compare estimated vs actual row counts — a 10× gap is your debugging starting point. The query-optimizer module takes this further.