What a Data Engineer actually does
A working definition of the role and how it sits between Backend, BI and ML — told through a 2 AM pager story.
2 AM, somewhere on the internet
It's 2 AM. Your phone explodes. The CFO's dashboard says revenue yesterday: $0.
The website is up. Checkout works — you can place an order yourself. So why is the dashboard saying zero?
Fifteen minutes of digging later you find it: a backend engineer renamed orders.amount_cents to orders.amount_minor_units four hours ago. The ETL job tried to read the old column, found nothing, wrote nothing, and emitted no error.
This story is the job description of a Data Engineer (DE) — the person who lives between the systems that create data and the people who trust it. When that boundary breaks, you are who gets called.
Three groups, one bridge
A DE sits between three groups who otherwise rarely talk to each other:
- Backend engineers produce data — checkout, catalog, billing, auth. They optimize for low-latency single-row writes.
- Analysts and BI consume data — dashboards, ad-hoc questions, decisions. They optimize for wide reads and fresh numbers.
- ML engineers consume and produce data — features, predictions, embeddings. They want both.
If those three groups talked to each other every day, the DE role would not exist. Reality: the backend team renames a column on Tuesday and the BI team finds out on Friday when the dashboard goes red.
The bridge, drawn (and the broken pipe from the 2 AM story)
Solid arrows are the steady-state pipeline; the dashed column rename arrow is the 2 AM revenue-is-zero story above. The DE owns every solid arrow on this diagram, and the dashed one is the failure mode the role exists to prevent.
What you actually do, week to week
- Modeling fact and dimension tables that survive contact with reality. (in the data-modeling module)
- Writing idempotent pipelines that can be replayed without breaking downstream consumers. (transactions and incremental modules)
- Reading query plans so you can prove an optimization actually helped — not just because the query "feels faster". (query-optimizer module)
- Negotiating data contracts so the column rename above doesn't become an outage. (data-quality-and-contracts module)
- Owning observability — freshness, volume, schema, quality alerts. (observability module)
- Saying no, with data, to a stakeholder who wants "a quick join" between two tables that absolutely should not be joined.
Old DE joke: the difference between a Data Engineer and a Data Scientist is that the DE knows the numbers are wrong before the DS does, but lies awake at night about it.
OLTP vs OLAP — the most useful axis
The single distinction that explains 80% of architecture choices in this course.
OLTP — Online Transaction Processing. Short, point-style operations. Read or update one row at a time. Examples: checkout, inventory decrement, balance update.
- Optimized for writes.
- Schema is normalized — no duplicated data.
- Storage is row-oriented — a row's columns are stored next to each other.
- Concrete: Postgres, MySQL, SQL Server.
OLAP — Online Analytical Processing. Wide scans and aggregations. Read millions of rows, write rarely. Examples: revenue by country and month, churn cohorts.
- Optimized for reads.
- Schema is often denormalized — duplication is fine if it speeds up scans.
- Storage is column-oriented — a column's values are stored together, perfect for aggregations.
- Concrete: BigQuery, Snowflake, ClickHouse, DuckDB.
Why two systems? Picture a retailer whose "top sellers this week" report runs every Friday afternoon — exactly when checkout traffic peaks for the weekend. The analytics scan competes with order writes for the same buffer cache, the same locks, the same I/O. Latency on POST /checkout creeps up. Sales drop a percent or two. Nobody notices in a single Friday — but it's the contention pattern that pushed the industry to keep OLAP and OLTP in separate stores from day one.
Which of the following workloads is the *best* match for a column-store OLAP system?
Back to our 2 AM revenue-is-zero story. Which mechanism would have *most directly* prevented the outage?
Make the OLAP point physical. From the seeded `orders` table (one row per order, status in {paid, refunded, cancelled}), return revenue from **paid** orders grouped by month. Output columns `month` (a `date` truncated to the first of the month) and `revenue_cents`, ordered by `month`. This is the shape of query a column-store would shine on — and it's also the shape of query that would bother an OLTP store at peak.
Takeaway: DE is the role that owns the boundary between data producers and data consumers. The whole rest of this course is tools and patterns for owning that boundary well.