Deep SQL & Production Data Engineering

From your first SELECT to a defensible data platform

A 23-module course that takes you from the fundamentals of SQL to production-grade data platforms. Lessons are 15 minutes each, marked per role (BA, BI, DA, AE, DE, DBA, BE, MLE, PE) so the same course adapts whether you are a Business Analyst learning SQL for the first time or a Data Engineer preparing for a senior interview.

What you'll be able to do

Write production-quality SQL across Postgres, DuckDB, ClickHouse, BigQuery, Snowflake and Spark.
Read a query plan and prove your optimization with measurements.
Design star schemas, SCD2 dimensions and idempotent pipelines.
Operate dbt + Airflow with tests, contracts and lineage.
Defend a production-ready data platform end to end.

Mindset of a Data Engineer

SQL is not just a query language — it is the interface to systems that store, process and deliver data. This module sets up the mental model you'll need for the rest of the course and gets your local stack running.

What a Data Engineer actually does 15 minCORE
ETL vs ELT, lakehouses and the modern data stack 15 minCORE
Spin up Postgres locally — in five minutes 15 minCORE

SQL Core and the relational model

The fundamentals every role needs: tables and keys, NULL semantics, the SELECT pipeline, and how the same logical query can run in many physical ways. By the end of this module you can read and write the SQL that 80% of analytics work consists of, and you understand why it sometimes runs slowly.

Tables, keys and grain — or how I shipped a 30% revenue bug 15 minCORE
NULL: the trap that costs every team an outage 15 minCORE
The SELECT pipeline: order of operations 15 minCORE

JOINs and join algorithms

Joining tables is where most production SQL bugs live. This module makes the cardinality of every join explicit, names the physical algorithms an optimizer picks between, and rehearses the mistakes that turn '12 rows' into '12 million rows'. Set operations (UNION/INTERSECT/EXCEPT) get their own treatment in a later module.

Aggregations, window functions and analytical SQL

Most analytical work is one of: GROUP BY, a window function, or a clever combination. By the end of this module you can write retention cohorts, running totals, ranking, and the gap-and-island patterns that appear in every job interview without leaving SQL.

GROUP BY: COUNT, FILTER, and the dashboard that lied for a year 15 minCORE
Window functions: ROW_NUMBER, LAG, and the running total that took 9 hours 15 minCORE
Retention cohorts and conversion funnels — pure SQL 15 minCORE

Schema design, DDL and the migrations that almost killed us

A table is a contract with everyone who reads it. This module is about choosing types, keys and constraints that survive the next ten migrations — plus the playbook for adding a NOT NULL column to a 200M-row table without taking the site down.

Indexes and physical storage — when fewer indexes is the right answer

Pick indexes for real query patterns instead of indexing every column you can think of. Every index makes reads faster and writes slower; the trick is knowing which trade you're making.

B-tree, GIN, BRIN and the field guide to picking one 15 minCORE

Query optimizer and reading plans — guessing is how regressions ship

Diagnose performance problems with EXPLAIN ANALYZE instead of guessing. Learn the three most useful flags, the four nodes that explain 80% of slow queries, and the rule that makes optimizations actually stick.

EXPLAIN ANALYZE in 15 minutes — the only intro you'll ever need 15 minCORE

Transactions, MVCC, locks — and the write skew that paid the same bonus twice

Why correct SQL can break under concurrent load, and how isolation levels save (or trap) you. The lesson where SELECT statements suddenly need to know about other transactions.

Isolation levels — the bonus that was paid twice 15 minCORE
Idempotent UPSERT and the load that you can safely run twice 15 minCORE

Data modeling for analytics — the star schema and the dimension that didn't change

Build a star schema and an SCD2 dimension that survive a year of dashboards. The lesson where 'why is this dashboard slow?' becomes a modeling question, not an indexing one.

dbt and SQL as engineering code

Move from ad-hoc queries in a notebook to a tested, version-controlled, code-reviewed pipeline. dbt is the framework that turned analytics from craft to engineering — this module is the practical pivot of the course.

Project structure: the staging→intermediate→marts pattern 15 minCORE
Tests, incremental models, and the dbt run that didn't break 15 minCORE

Data quality and data contracts — the column rename that didn't take down prod

Define what 'good data' means before someone asks why the dashboard is wrong. The six dimensions, the contract pattern that catches schema drift in CI, and the quarantine table that saves your pipeline from a single bad row.

The six quality dimensions — vocabulary that ends taste arguments 15 minCORE
Data contracts — the column rename that didn't take down prod 15 minCORE

Airflow and pipeline orchestration — the backfill that didn't double the metrics

Schedule, retry and backfill data jobs with predictable behavior. Airflow is the framework most analytics teams converge on; this module is the smallest set of patterns that make it not bite you.

Your first DAG: extract → load → dbt → publish → notify 15 minCORE

Database Engineering Lab I: Postgres, DuckDB, ClickHouse

Same dataset, three engines. See storage layout, latency and use-case fit side by side — and learn the question to ask before picking one for production.

Three engines, one dataset — and the wrong choice that cost a quarter 15 minCORE

Cloud warehouses: BigQuery and Snowflake — the bills that bite

Managed warehouses still need engineering — partitioning, clustering and cost control. The difference between a $300/month bill and a $30,000/month bill is usually one missing partition filter.

BigQuery vs Snowflake — and the SELECT * that filled a quarterly bill 15 minCORE

Spark SQL, Databricks and Delta Lake — when one node isn't enough

Distributed SQL: when one node is not enough, and how the lakehouse stays consistent. Plus the three reasons a Spark job is slow and how to read the UI to know which one bit you.

Shuffle, skew, and the small-files problem 15 minCORE

Apache Iceberg and Trino — open table formats and federated SQL

Open table formats and federated SQL across many sources. Iceberg gives you ACID, schema evolution and time travel on Parquet files; Trino lets you JOIN Postgres to S3 to BigQuery in one query.

Iceberg in one lesson — the year someone changed history 15 minCORE

Non-relational sources: Mongo, Elasticsearch, Vector DB

How to ingest semi-structured and search-oriented data into a relational world without losing your sanity. The lesson where JSON drift becomes a tracked event, not a 3 AM page.

Ingesting semi-structured data without losing your sanity 15 minCORE

Incremental processing, CDC and late data — and the watermark that missed an update

How to keep your warehouse in sync without reloading the world every night. The watermark + overlap pattern, the SCD2-from-CDC stream, and the reconciliation that catches what watermarks miss.

Watermarks, overlap windows, and the missed updates that ate a Tuesday 15 minCORE

Data observability, lineage and governance — knowing before Slack tells you

Know what is happening to your data in production before someone in Slack tells you. The three alert classes, the ownership rule that ends 'who's on call', and column-level lineage as impact-analysis-as-code.

Designing alerts that fire when they should — and only when they should 15 minCORE

Security, privacy and access — least privilege as the default

Make 'least privilege' the default, not an afterthought. The masked-view pattern, the row-level security trick, and the rule that keeps PII out of accidental places.

RBAC and a masked-view pattern — and the breach that wasn't 15 minCORE

SQL style, maintainability and code review — the 15-question checklist

Code that survives years instead of weeks. The smells, the conventions, and the checklist that catches the bugs your tests don't.

The 15-question SQL review checklist — and the smells it catches 15 minCORE

Performance engineering: from query to system — the eight levers

Connect the developer's choices to CPU, memory, IO, network and cost. The ranked list of where speed actually comes from — and the discipline that turns guesses into measurements.

The eight levers of analytical performance — ranked by ROI 15 minCORE

Capstone: production-grade data platform

Bring everything together into one project you can defend in an interview. The 14 questions you must answer in the final review — and the rule that makes the project actually represent your skill.

Capstone brief: what to build, how to defend it 15 minCORE