← Deep SQL & Production Data Engineering

Three engines, one dataset — and the wrong choice that cost a quarter

Postgres, DuckDB, ClickHouse — when each wins, when each loses, and the question that actually matters when you pick.

"We picked ClickHouse because it's fast."

A team I knew picked ClickHouse for an OLTP workload because of a benchmark blog post. Three months later they'd shipped a checkout system that took 800ms per single-row insert under load. ClickHouse is a column-store optimized for batch ingest and wide aggregations — point inserts are not its job. They migrated to Postgres in a six-week panic.

The lesson is not "ClickHouse is bad". ClickHouse is excellent for the right workload (analytics over billions of rows, where Postgres would be 100× slower). The lesson is picking an engine without naming the workload. Same dataset, same SQL, totally different runtime depending on the engine.

At a glance

Engine	Sweet spot	Storage	Notes
Postgres	OLTP + light OLAP	row-store (heap)	The default. Extensions: pgvector, postgis, timescaledb. ACID, transactions, foreign keys.
DuckDB	Single-node OLAP, embedded	column-store	Reads Parquet/CSV directly. Runs in-process — the analyst-laptop database. No server to manage.
ClickHouse	Distributed OLAP	column-store, MergeTree	Throws billions of rows at aggregations. Weak at point lookups, no transactions, eventual-consistent replication.

The five questions that pick the engine for you.

Workload shape — point lookups (Postgres) vs wide scans (column-store)?
Concurrency — 1000s of small queries (Postgres) vs dozens of huge ones (column-store)?
Data volume — <1 TB (Postgres / DuckDB) vs 10s-100s of TB (ClickHouse, BigQuery, Snowflake)?
Update frequency — frequent UPDATEs (Postgres) vs append-mostly (column-store)?
Latency tolerance — sub-second (Postgres for OLTP, column-store for cached OLAP) vs minutes ok (Spark, Trino)?

Answer these before you read a vendor blog. Most engine debates are about (1) and (4), and most choices are wrong because nobody asked.

DuckDB on a Parquet folder — no server, no schema, just go

-- Inside duckdb (or via the CLI: `duckdb`)
-- Read every Parquet file in a folder as a table:
CREATE VIEW orders AS
SELECT * FROM read_parquet('/data/lake/orders/year=2025/month=*/*.parquet');

-- Now query it like any SQL table.
SELECT date_trunc('week', created_at) AS week,
       SUM(total_cents)               AS revenue
FROM orders
WHERE status = 'paid'
GROUP BY 1
ORDER BY 1;

-- DuckDB is column-store — this aggregates 100M rows in seconds on a laptop.
-- The Parquet files don't have to be loaded; DuckDB reads only the columns and
-- row groups it needs (predicate pushdown, projection pushdown).

Demonstrate the *projection pushdown* idea — a column-store would only read the columns the query asks for. Return one row, one column `selected_columns`, with the count of columns in your SELECT (here it's 2 — `customer_id` and `revenue`).

Takeaway: pick the engine for the workload, not the workload for the engine. Postgres handles 80% of teams' needs. DuckDB is the analyst's secret weapon for ad-hoc Parquet analysis. ClickHouse / BigQuery / Snowflake are for when aggregations over billions is the daily job. Document the why in your README — your future hire will thank you.