← Deep SQL & Production Data Engineering

Iceberg in one lesson — the year someone changed history

Snapshots, manifests, schema evolution. What changes vs Hive, and the time-travel feature you'll actually use.

"Someone changed Wednesday"

A team I knew built a finance dashboard on Hive tables. One Wednesday morning, the prior Wednesday's revenue numbers had changed. Nobody had a record of why. Hive had no transaction log; the underlying Parquet files had been overwritten by a backfill job.

The finance team wanted the original number, as it was reported on Wednesday morning, plus the current number, plus an audit trail. Hive could deliver none of those.

Apache Iceberg added a metadata layer on top of Parquet that gives you all three: every commit produces a snapshot, and SELECT ... FOR VERSION AS OF <snapshot> returns the data exactly as it was at that point. The team migrated; the auditing problem disappeared.

What Iceberg is

A table format = a metadata layer on top of Parquet/ORC files. Iceberg's metadata is a tree:

  table_metadata-v0073.json
    ├── current-snapshot-id: 73      ◄ HEAD
    ├── snapshots: [71, 72, 73]      (each with parent_id; never deleted on commit)
    └─→ snap-73-manifest-list.avro
          ├─→ manifest-A → [data-1.parquet, data-2.parquet]
          ├─→ manifest-B → [data-3.parquet]
          └─→ manifest-C → [data-4.parquet]   ◄ added in snap 73

  TIME TRAVEL  SELECT … FROM events FOR VERSION AS OF 71
               (read snap-71-manifest-list, find files alive then)
  ROLLBACK     ALTER TABLE events EXECUTE rollback_to_snapshot(72)
               (flips current-snapshot-id back; old snapshots stay on disk)

Because each commit appends and never overwrites, last Wednesday's snapshot is still on disk — that's what fixes the audit problem from the story. Readers always see a consistent set of files (no "some files updated, some not" mid-read like Hive). Schema evolution (rename column, change type, drop column) is metadata-only — no rewrites.

The alternatives: Delta Lake (Databricks) and Apache Hudi (originally Uber). Same shape, different vendors. Iceberg has the broadest engine support (Spark, Trino, Flink, Snowflake, BigQuery, Athena) and the simplest catalog story.

Trino — one SQL query across three engines

-- Trino's killer feature: one SQL across many sources.
-- Each catalog points at a different engine.

SELECT c.country,
       SUM(o.total_cents)::numeric / 100 AS revenue_usd,
       COUNT(p.payment_id)               AS attempts
FROM iceberg.lake.fct_orders   o      -- Iceberg on S3
JOIN postgres.oltp.customers   c      -- live Postgres
  ON c.customer_id = o.customer_id
JOIN bigquery.payments.attempts p     -- BigQuery
  ON p.order_id = o.order_id
WHERE o.created_at >= DATE '2025-03-01'
  AND o.created_at <  DATE '2025-04-01'
GROUP BY c.country;

-- The risk: Trino federates the *engines*, not the *speed*. The Postgres scan is
-- as fast as Postgres. The BigQuery scan as fast as BigQuery. The JOIN happens in
-- Trino's memory. Without partition predicates pushed down, you can move TBs across
-- the wire.

Pretend you're querying an Iceberg snapshot from yesterday. Use a `WITH snapshot AS (...)` to filter `orders` to created_at <= '2024-04-15' (a synthetic 'as-of' date), then SUM paid revenue. Return one row, one column `historical_revenue`.

Takeaway: Iceberg adds ACID + time travel + cheap schema evolution to Parquet, with the broadest engine support. Trino lets you JOIN across engines — convenient, with the discipline that you must verify pushdown or pay in shuffled bytes.