Codabra

Deep SQL & Production Data Engineering

From your first SELECT to a defensible data platform

A 23-module course that takes you from the fundamentals of SQL to production-grade data platforms. Lessons are 15 minutes each, marked per role (BA, BI, DA, AE, DE, DBA, BE, MLE, PE) so the same course adapts whether you are a Business Analyst learning SQL for the first time or a Data Engineer preparing for a senior interview.

What you'll be able to do

  • Write production-quality SQL across Postgres, DuckDB, ClickHouse, BigQuery, Snowflake and Spark.
  • Read a query plan and prove your optimization with measurements.
  • Design star schemas, SCD2 dimensions and idempotent pipelines.
  • Operate dbt + Airflow with tests, contracts and lineage.
  • Defend a production-ready data platform end to end.

SQL Core and the relational model

The fundamentals every role needs: tables and keys, NULL semantics, the SELECT pipeline, and how the same logical query can run in many physical ways. By the end of this module you can read and write the SQL that 80% of analytics work consists of, and you understand why it sometimes runs slowly.

  1. Tables, keys and grain — or how I shipped a 30% revenue bug 15 minCORE
  2. NULL: the trap that costs every team an outage 15 minCORE
  3. The SELECT pipeline: order of operations 15 minCORE

JOINs and join algorithms

Joining tables is where most production SQL bugs live. This module makes the cardinality of every join explicit, names the physical algorithms an optimizer picks between, and rehearses the mistakes that turn '12 rows' into '12 million rows'. Set operations (UNION/INTERSECT/EXCEPT) get their own treatment in a later module.

  1. JOIN types — and the 47 customers nobody noticed were missing 15 minCORE
  2. The Black Friday +800% revenue bug — and the diagnostic playbook 15 minCORE
  3. How JOINs run: nested loop, hash, merge — and the ANALYZE you forgot 15 minCORE

Aggregations, window functions and analytical SQL

Most analytical work is one of: GROUP BY, a window function, or a clever combination. By the end of this module you can write retention cohorts, running totals, ranking, and the gap-and-island patterns that appear in every job interview without leaving SQL.

  1. GROUP BY: COUNT, FILTER, and the dashboard that lied for a year 15 minCORE
  2. Window functions: ROW_NUMBER, LAG, and the running total that took 9 hours 15 minCORE
  3. Retention cohorts and conversion funnels — pure SQL 15 minCORE

Indexes and physical storage — when fewer indexes is the right answer

Pick indexes for real query patterns instead of indexing every column you can think of. Every index makes reads faster and writes slower; the trick is knowing which trade you're making.

  1. B-tree, GIN, BRIN and the field guide to picking one 15 minCORE

Query optimizer and reading plans — guessing is how regressions ship

Diagnose performance problems with EXPLAIN ANALYZE instead of guessing. Learn the three most useful flags, the four nodes that explain 80% of slow queries, and the rule that makes optimizations actually stick.

  1. EXPLAIN ANALYZE in 15 minutes — the only intro you'll ever need 15 minCORE

Airflow and pipeline orchestration — the backfill that didn't double the metrics

Schedule, retry and backfill data jobs with predictable behavior. Airflow is the framework most analytics teams converge on; this module is the smallest set of patterns that make it not bite you.

  1. Your first DAG: extract → load → dbt → publish → notify 15 minCORE

Spark SQL, Databricks and Delta Lake — when one node isn't enough

Distributed SQL: when one node is not enough, and how the lakehouse stays consistent. Plus the three reasons a Spark job is slow and how to read the UI to know which one bit you.

  1. Shuffle, skew, and the small-files problem 15 minCORE

Apache Iceberg and Trino — open table formats and federated SQL

Open table formats and federated SQL across many sources. Iceberg gives you ACID, schema evolution and time travel on Parquet files; Trino lets you JOIN Postgres to S3 to BigQuery in one query.

  1. Iceberg in one lesson — the year someone changed history 15 minCORE

Non-relational sources: Mongo, Elasticsearch, Vector DB

How to ingest semi-structured and search-oriented data into a relational world without losing your sanity. The lesson where JSON drift becomes a tracked event, not a 3 AM page.

  1. Ingesting semi-structured data without losing your sanity 15 minCORE

Incremental processing, CDC and late data — and the watermark that missed an update

How to keep your warehouse in sync without reloading the world every night. The watermark + overlap pattern, the SCD2-from-CDC stream, and the reconciliation that catches what watermarks miss.

  1. Watermarks, overlap windows, and the missed updates that ate a Tuesday 15 minCORE

Data observability, lineage and governance — knowing before Slack tells you

Know what is happening to your data in production before someone in Slack tells you. The three alert classes, the ownership rule that ends 'who's on call', and column-level lineage as impact-analysis-as-code.

  1. Designing alerts that fire when they should — and only when they should 15 minCORE

Performance engineering: from query to system — the eight levers

Connect the developer's choices to CPU, memory, IO, network and cost. The ranked list of where speed actually comes from — and the discipline that turns guesses into measurements.

  1. The eight levers of analytical performance — ranked by ROI 15 minCORE

Capstone: production-grade data platform

Bring everything together into one project you can defend in an interview. The 14 questions you must answer in the final review — and the rule that makes the project actually represent your skill.

  1. Capstone brief: what to build, how to defend it 15 minCORE