← Deep SQL & Production Data Engineering

EXPLAIN ANALYZE in 15 minutes — the only intro you'll ever need

How to read the plan, what each node means, where the time actually goes — and the rule of thumb that catches 80% of regressions.

"It's slow" is not a diagnosis

A team I knew spent two weeks rewriting a query that felt slow. They added indexes, refactored joins, even paid for a bigger DB instance. The query was still slow. They escalated to me.

I ran EXPLAIN (ANALYZE, BUFFERS) once. The plan showed the bottleneck was a single index lookup that was reading 18 GB of buffers for a query that returned 200 rows. The index was on a low-cardinality column, so each lookup hit half the table. Fix: composite index on (status, customer_id) instead of (status). The query went from 14 seconds to 90 milliseconds.

Two weeks of guessing replaced by 30 seconds of measuring. Every senior DE has this story. Today you learn the tool that prevents your version of it.

The three flags you'll actually use

-- Plan only (no execution, no actual rows):
EXPLAIN
SELECT customer_id, COUNT(*) FROM orders GROUP BY customer_id;

-- Plan + execute. The flags below are the ones that matter:
EXPLAIN (
  ANALYZE,    -- actually run the query and report actual times/rows
  BUFFERS,    -- show shared/temp/local buffer hits + reads
  VERBOSE,    -- include schema-qualified names and extra detail
  FORMAT TEXT -- default; use FORMAT JSON for tooling
)
SELECT customer_id, COUNT(*) FROM orders GROUP BY customer_id;

-- WARNING: ANALYZE *runs the query*. Don't EXPLAIN ANALYZE a 6-hour DELETE.
-- Wrap in a transaction and ROLLBACK if needed:
BEGIN;
EXPLAIN (ANALYZE) DELETE FROM orders WHERE created_at < '2020-01-01';
ROLLBACK;

Reading a plan: the four-thing checklist

For every node in the output, look at four numbers:

rows=N (estimate) vs actual rows=M. Within 10× = fine. >10× off = stale statistics or inherently hard estimate. Run ANALYZE on the touched tables; if still off, consider CREATE STATISTICS for correlated columns.
Loops × Time per loop. A Nested Loop with high loops= and a non-trivial inner cost is the classic "why is this O(n²)?" slowdown.
Buffers: shared hit=X read=Y. read is a cache miss. Many reads = your working set doesn't fit in shared_buffers and Postgres is going to disk.
Memory spills. "Sort Method: external merge Disk: 1234kB" means the sort spilled. Bump work_mem for that session, or rewrite to need less memory.

The four numbers above explain ~80% of slow-plan diagnoses. The other 20% is interesting research.

Node types you'll see most often

Seq Scan — read every row. Fine for small tables and when most rows match. Bad when a selective index exists and isn't used.
Index Scan — read the index, then jump to the heap for each match. Good for selective predicates (1–5% of rows).
Index Only Scan — read just the index, no heap. Possible when all selected columns are in the index and the visibility map is clean (run VACUUM to keep this fast).
Bitmap Heap Scan (after a Bitmap Index Scan) — collect matching tuple IDs from the index, sort by physical location, then read the heap once in order. The middle ground when many rows match.
Hash Join / Merge Join / Nested Loop — covered in Module 02. The plan tells you which one was picked. The diagram below shows where they sit in the tree.

Reading a plan as a tree

For:

SELECT c.name, SUM(o.total_cents) AS revenue
FROM customers c JOIN orders o USING(customer_id)
WHERE o.status = 'paid'
GROUP BY c.name
ORDER BY revenue DESC
LIMIT 10;

The plan, drawn:

   Limit  (cost=2510..2510  rows=10)  (actual rows=10  loops=1)
     │
   Sort   (sort key: revenue DESC)
     │
   HashAggregate  (group key: c.name)              ◄── collapses to ~7 rows
     │   ▲
     │   └─ HashAggregate builds a hash table per group (memory-bound).
     │      GroupAggregate is the alternative — streams already-sorted input,
     │      cheap when the optimizer can use an index sort to feed it.
     ▼
   Hash Join  (cond: o.customer_id = c.customer_id)
     ├──► Hash  (build side: customers, 7 rows)
     │      └──► Seq Scan on customers c
     └──► Seq Scan on orders o   (filter: status = 'paid')
                                  ▲
                                  └─ rows estimated 8, actual 6 → close enough.
                                     if estimated 1 and actual 100 000, fix stats.

The tree reads bottom-up — leaves run first, results flow up. Where to look first:

The last node where estimated rows vs actual rows diverges by >10× — stale stats or correlated columns. ANALYZE first.
Nodes whose Buffers: shared read = … is large — cold cache, working set doesn't fit shared_buffers.
Nodes that spilled to disk (external sort … on disk … KB) — bump work_mem for that session, or rewrite to need less memory.

pg_stat_statements — the production version of EXPLAIN. Enable the extension once (CREATE EXTENSION pg_stat_statements; + add to shared_preload_libraries), then query the worst offenders:

SELECT query,
       calls,
       total_exec_time::int AS total_ms,
       mean_exec_time::int  AS mean_ms,
       (100.0 * shared_blks_read / NULLIF(shared_blks_hit + shared_blks_read, 0))::int AS miss_pct
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;

Sorts queries by total time spent, not per-call time — the right metric. A 50ms query called a million times deserves more attention than a 5-second query called once.

Write the query you'd want to `EXPLAIN ANALYZE` next: top customers by paid revenue, descending. Return `customer_id`, `paid_revenue_cents` (sum), one row per customer with at least one paid order, sorted by `paid_revenue_cents` descending then `customer_id` ascending. The query has aggregation + filter + sort — exactly the shape whose plan you'd inspect.

Your `EXPLAIN ANALYZE` shows: `Hash Join (cost=8.45..15.23 rows=10 width=20) (actual time=234.5..235.1 rows=104832 loops=1)`. The most useful next step is:

Takeaway: optimization without a measurement is a guess. EXPLAIN (ANALYZE, BUFFERS) is the measurement. Estimate-vs-actual gaps over 10× point at stats; high Buffers: read points at memory; high Loops on a Nested Loop points at join algorithm. Capture the plan before and after every change so the diff proves the win.