Codabra

ETL vs ELT, lakehouses and the modern data stack

Why the industry shifted from ETL to ELT, what a lakehouse actually is, and when ETL is still right.

A war story to start

A team I knew once spent six months building a beautiful Informatica pipeline that cleaned, deduplicated, joined and aggregated everything before loading into the warehouse. The CTO loved it: "Look, the warehouse is so clean!"

Then marketing asked "why did our trial-to-paid conversion drop 8% three Wednesdays ago?" The team needed the raw signup events to investigate. They had thrown them away. The transform had been the truth, the source had been forgotten.

It took four months to rebuild that pipeline ELT-style — keep the raw, transform on the way out — at a cost of about $400k in engineer-time. The CTO did not love that part.

The rest of this lesson is the lesson they learned, in a form you can absorb in 12 minutes.

A 30-second history

In the 1990s, disk was expensive and the warehouse — Teradata, Oracle, Informix — was the most expensive box in the room. You did not load raw data into it. You transformed first, kept only what mattered, then loaded the polished result. That's ETL: Extract → Transform → Load. Tools like Informatica and DataStage did the transform on dedicated middleware servers.

Then object storage got cheap (S3 dropped to a couple of cents per GB-month) and warehouses started separating storage from compute — BigQuery, Redshift, Snowflake. You could pay for 100 cores for ten minutes, then pay nothing. The economics flipped.

Now: dump everything raw into the warehouse, transform with SQL inside it. ELT: Extract → Load → Transform.

Same arrows, different order

ETL discards the source after transforming — fast warehouse, no replay. ELT keeps the source as Bronze and lets cheap warehouse compute do the rest. Bronze / Silver / Gold below is the medallion convention; you can implement it in plain Postgres — the discipline matters more than the tool.

Why ELT won (mostly)

  • Replayability. When the analyst asks "why did this metric change last Wednesday?", you can re-run the transform against last Wednesday's raw data. ETL discards the raw — you can't.
  • Schema-on-read. A new column appears upstream? It lands in the raw layer untouched. Your transforms continue to work.
  • One language. SQL replaces the proprietary mapping UI of legacy ETL tools. Engineers can read the code, version it in git, write tests for it.
  • Cheap. Storage is so cheap that keeping six months of bronze raw is often less than $50/month.

ETL is not dead. Three cases where you still transform before load:

  1. PII redaction — GDPR and HIPAA make landing raw personal data in a US-hosted warehouse expensive or contractually painful (transfers need adequacy decisions, SCCs, BAAs). Stripping the SSN or hashing the email at the edge sidesteps a lot of legal exposure.
  2. Massive volume — when you ingest 100 TB/day from clickstreams, you can't afford to land all of it. You filter and aggregate at the edge first.
  3. Cost gates — some warehouses charge by bytes scanned. Aggregating before load can save five figures a month.

Lakehouse, in one paragraph

A data lake = cheap object storage (S3, R2, GCS) with files in it (Parquet, JSON, CSV). Cheap, but no transactions, no schema enforcement, hard to query.

A warehouse = SQL semantics, ACID transactions, fast — but historically expensive and locked-in.

A lakehouse is the marriage of the two: files on S3, plus a transactional metadata layer on top. The metadata layer is implemented by an open table formatDelta Lake (Databricks), Apache Iceberg (originally Netflix), or Apache Hudi (originally Uber).

The payoff: one copy of the data on cheap storage, queried by many engines (Spark, Trino, DuckDB, Snowflake, BigQuery) without lock-in.

Bronze / Silver / Gold is the convention you'll see everywhere (Databricks coined it; everyone copies it):

  • Bronze — raw, append-only, faithful copy of the source. "What did the API actually send us?"
  • Silver — cleaned, deduplicated, joined, type-corrected. "What does this data really mean?"
  • Gold — business-facing facts and dimensions that BI dashboards read. "What number do I show the CFO?"

It is not a tool. It is a discipline. You can implement it in plain Postgres just fine — many teams do.

What "modern data stack" really means

A shorthand for: a managed warehouse (Snowflake / BigQuery / Redshift) + a managed loader (Fivetran / Airbyte) + dbt for in-warehouse transforms + an orchestrator (Airflow / Dagster / Prefect) + a BI tool (Looker / Mode / Metabase).

It's not magic — it's a sane default for an analytics team that doesn't want to glue together 40 components. We use this stack throughout the course: Postgres as a stand-in warehouse, dbt for transforms, Airflow for orchestration. The patterns transfer to BigQuery and Snowflake unchanged.

You inherit a pipeline. The analytics team complains they cannot trace why a metric changed last week. Which architectural choice most directly helps?

Which scenario is the strongest argument for keeping a transform-before-load (ETL-style) step?

Takeaway: "ELT is modern, ETL is dead" is a slogan, not a rule. Pick based on cost, replayability, and legal constraints. Keep your bronze layer.