Codabra

How JOINs run: nested loop, hash, merge — and the ANALYZE you forgot

What the optimizer actually does, and why join order can change a query's runtime by 1000×.

The query that took 4 hours overnight

A pipeline that had run in 90 seconds for a year started taking 4 hours one morning. The on-call DE bisected: nothing in the SQL had changed, no schema change, no new indexes. The only difference was that a 100M-row backfill had loaded the previous night.

The planner had been picking a hash join on the previous data shape. After the backfill, its cardinality estimate was based on stale statistics — it now thought the right side was tiny and switched to a nested loop, scanning 100M rows once per left-side row. Hence: 4 hours.

The fix took 3 seconds: ANALYZE order_items; Bulk loads do not automatically refresh statistics aggressively enough.

Three physical algorithms

The SQL INNER JOIN is logical — what you want. The optimizer picks one of three physical implementations based on size estimates:

  • Nested loop — for each row on the left, scan the right (or probe an index). Fast when one side is tiny or there's an index on the right's join key. Catastrophic when both are large and unindexed: O(left × right).
  • Hash join — build a hash table on the smaller side in memory, probe with the larger side. The workhorse of OLAP queries. O(left + right). Memory-bound.
  • Merge join — sort both sides on the join key, walk both in lockstep. Wins when inputs are already sorted (e.g. clustered indexes, ordered scans). O(left + right) plus sort cost if not already sorted.

A fourth case worth knowing: the broadcast join in distributed engines (Spark, Trino, BigQuery) — if one side is small enough, copy it to every executor and do a hash join locally. Brutal when the "small" side isn't actually small.

What the optimizer is choosing between, in pseudocode

NESTED LOOP                  HASH JOIN                  MERGE JOIN
─────────────                ─────────                  ──────────
for l in L:                  build H = {r.k: r}         sort L on key
  for r in R:                  for r in R               sort R on key
    if l.k == r.k:           for l in L:                walk both in lockstep:
      emit(l, r)               if l.k in H:               while L.k < R.k: L++
                                 emit(l, H[l.k])          while L.k > R.k: R++
                                                          if equal: emit, ++both

O(|L| × |R|)                 O(|L| + |R|)               O(|L| + |R|) + sort
unless R is indexed          memory-bound on |R|        free if pre-sorted
─────────────                ─────────────              ─────────────
small × any                  any + any (small fits      any + any with
or any + indexed key         in memory) — OLAP          clustered indexes
                             workhorse                  or already-sorted scans

The planner picks based on pg_statistic size estimates. When those are wrong, the planner picks badly — that's the next callout's setup.

Why the optimizer guesses wrong. Postgres draws estimates from pg_statistic — sample-based histograms, distinct counts, correlation. After a bulk load, those samples lag. Run ANALYZE after big writes (or set autovacuum_analyze_scale_factor lower for that table). For a deeper dive on plans, see Module 7.

EXPLAIN (ANALYZE) — read the join algorithm and the row estimates
EXPLAIN (ANALYZE, BUFFERS)
SELECT c.name, COUNT(*)
FROM customers c
JOIN orders o ON o.customer_id = c.customer_id
WHERE o.status = 'paid'
GROUP BY c.name;

-- Look for:
--   "Hash Join"   -- the algorithm picked
--   "rows=N"      -- the estimate
--   "actual rows=M" -- what really happened
-- If estimate and actual differ by >10x, your stats are bad.

Your aggregation query has been fast for months. Last night a 100M-row backfill ran. This morning the query is 100× slower. The most likely cause?

In Spark, you set `spark.sql.autoBroadcastJoinThreshold = 100MB` and a query JOINs a 90 MB lookup table to a 1 TB facts table. What does the planner *probably* do, and what's the failure mode?

Takeaway: the algorithm choice is the planner's, but its inputs are your statistics. Run ANALYZE after big writes. Read EXPLAIN ANALYZE and compare estimated vs actual row counts — a 10× gap is your debugging starting point. The query-optimizer module takes this further.