← Deep SQL & Production Data Engineering

The eight levers of analytical performance — ranked by ROI

A ranked list of where speed actually comes from in analytical SQL, with the war story for each lever.

"Just buy a bigger box"

A team I knew was hitting 95% CPU on their warehouse. The CTO was about to triple the instance size. Before signing the PO, the lead DE ran one experiment: she added a partition filter to the top-3 most-expensive recurring queries. CPU dropped to 30%. The bigger box was no longer needed.

Hardware is the last lever, not the first. The seven that come before it are mostly free, and they're ranked by ROI below. Optimizing in this order is what separates engineers from people who ship invoices.

Eight levers, ranked by typical ROI

  1. Reduce rows scanned         ████████████████████████████  free, ~10×+
  2. Reduce columns scanned      ████████████████████          5–100× on cols
  3. Pre-aggregate (mart layer)  ███████████████               amortizes 10000×
  4. Right join algorithm        ████████████                  fix the planner
  5. Cache / materialized views  █████████                     duplicate work
  6. Approximate (HLL, t-digest) ██████                        consumer rounds
  7. Concurrency / queues        ███                           isolation > speed
  8. Hardware (bigger box)       █                             last resort, $$$

Read top-down. Lever 1 alone solves most "the warehouse is slow" tickets. Hardware is what you reach for after the seven above are exhausted, not before — the CTO from the story would have signed a triple-size PO; the DE found lever 1 in 30 minutes and saved the spend.

Eight levers, ranked by ROI

Reduce rows scanned. Partition / cluster / index / push-down filters. The single highest-ROI optimization. Most slow analytical queries are scanning 10–1000× more rows than they should.
Reduce columns scanned. Column-store + project early (SELECT only what you need). On a 100-column events table, this is the difference between 30s and 0.3s.
Pre-aggregate in incremental marts; serve dashboards from there. A daily materialized aggregation queried 10,000 times/day costs 1/10,000 of computing it on-demand.
Right join algorithm — broadcast small, hash mid-size, merge if pre-sorted. Covered in Module 02 / 06.
Avoid recomputation with materialized views, query result cache, or app-level caches.
Approximate when the metric tolerates it. COUNT(DISTINCT user_id) over 1B rows is a sort-and-dedup; APPROX_COUNT_DISTINCT(user_id) (HyperLogLog) is one pass with ~2% error and ~1/100th the cost.
Concurrency control — workload isolation, query queues, separate warehouses for ad-hoc vs scheduled. Stops one runaway query from starving everyone.
Hardware — bigger box, more replicas, faster disk. The last resort. Always exhaust the seven above before paying.

When approximation is the right answer.

"Count distinct users last quarter" on 1B-row events table → exact COUNT(DISTINCT) is minutes; HLL is seconds. The 2% error is invisible to a stakeholder rounding to "about 12 million".
"Top 100 search terms by frequency" → exact heavy-hitter; sketch (Count-Min, Theta) is good enough.
"P99 latency" → exact requires sorting; t-digest is one pass.

Reach for approximation when the consumer of the metric is rounding anyway. If finance demands exact, give them exact (and a longer SLA).

Apply lever #1 (reduce rows scanned) and lever #2 (reduce columns scanned) to the same query. Compute the total paid revenue. Return one row, one column `revenue_cents`. The wrong way: `SELECT * FROM orders` then aggregate in the app. The right way: SELECT only the column you need, with a server-side filter.

Takeaway: rows scanned > columns scanned > pre-aggregation > join algorithm > caching > approximation > concurrency > hardware. Work the list in order. Optimization without a measurement is a guess — capture the plan before, change one thing, capture the plan after.