← Deep SQL & Production Data Engineering

Designing alerts that fire when they should — and only when they should

Alert fatigue is the easiest way to miss a real outage. Three classes, owner rule, and the rule of thumb that's saved more on-calls than coffee.

The 4,200 alerts nobody read

A team I knew had a Slack channel called #data-alerts. It got 4,200 messages a week. Most were noise — a single null in a column that allowed nulls, a query 1 second slower than the SLA, a backfill that took 11% longer.

The channel was muted within a month of every new joiner. When a real outage hit — fct_orders was empty for 3 hours during the morning load — nobody noticed. The channel had been delivering 600 alerts/day; the empty-table alert was one of them.

The fix is fewer, sharper alerts with named owners. Three classes per critical table; everything else is a dashboard, not a page. Today, that.

Three alert classes per critical table — that's it.

Freshness — the table has been updated within its SLO. MAX(loaded_at) more than N hours old → page. The single most useful alert in data engineering.
Volume — today's row count is within historical band. MAD-based check (median + N*MAD) over the last 30 days. Catches both sudden drops (pipeline broken) and sudden spikes (bot traffic, duplicate insert).
Schema drift — no unexpected column rename, removal, or type change. Compare today's information_schema.columns to yesterday's; alert on diff.

Nothing else fires alerts. Other checks (null rate, value distribution, business invariants) live on a dashboard — visible, monitored, but not paging.

Three classes, three populations

                   ┌─────────┐
                   │  PAGE   │  ◄ wakes the owner.
                   │         │     ~5/week per critical table.
                   └─────────┘     freshness, volume, schema drift.

                ┌──────────────┐
                │     WARN     │  ◄ Slack ping to the owner.
                │              │     ~30/week.
                └──────────────┘     p95 latency, slow queries, retries.

          ┌────────────────────────┐
          │    INFO (dashboard)    │  ◄ no notification.
          │                        │     hundreds of checks.
          │                        │     null rates, distributions,
          └────────────────────────┘     business invariants.

4 200/week alerts in one Slack channel = no alerts. The pyramid above — about 5 pages, 30 warns, the rest on a dashboard — is what a healthy on-call looks like. Promote a check from INFO to WARN to PAGE only when you'd actually want to be woken up for it.

The owner rule

Every table has one named owner — a human (or a small team alias). The owner's name is in:

The alert routing — pages go to the owner's pager, not a shared channel.
The dbt model description — meta: { owner: 'analytics-platform' }.
The runbook.

Why one owner. Two people sharing on-call for the same table reliably becomes zero people. The bystander effect, applied to your data platform. If two teams genuinely share a table, define a primary and a secondary; the primary pages, the secondary backs up.

Implement a freshness check. The fixture's `orders` table has a `created_at`. Compute the age of the most recent row in hours (`age_hours`) and a flag `is_stale` set to true if older than 24 hours. (Note: the fixture data is from March/April 2024, so `is_stale` will be `true` against any current `now()`.) Return one row, two columns.

Takeaway: three alert classes per critical table — freshness, volume, schema drift. Every table has one named owner. Everything else is a dashboard. The shared #data-alerts channel is the anti-pattern. Fewer, sharper, owned alerts beat 4,200 a week.