Designing alerts that fire when they should — and only when they should
Alert fatigue is the easiest way to miss a real outage. Three classes, owner rule, and the rule of thumb that's saved more on-calls than coffee.
The 4,200 alerts nobody read
A team I knew had a Slack channel called #data-alerts. It got 4,200 messages a week. Most were noise — a single null in a column that allowed nulls, a query 1 second slower than the SLA, a backfill that took 11% longer.
The channel was muted within a month of every new joiner. When a real outage hit — fct_orders was empty for 3 hours during the morning load — nobody noticed. The channel had been delivering 600 alerts/day; the empty-table alert was one of them.
The fix is fewer, sharper alerts with named owners. Three classes per critical table; everything else is a dashboard, not a page. Today, that.
Three alert classes per critical table — that's it.
- Freshness — the table has been updated within its SLO.
MAX(loaded_at)more than N hours old → page. The single most useful alert in data engineering. - Volume — today's row count is within historical band. MAD-based check (median + N*MAD) over the last 30 days. Catches both sudden drops (pipeline broken) and sudden spikes (bot traffic, duplicate insert).
- Schema drift — no unexpected column rename, removal, or type change. Compare today's
information_schema.columnsto yesterday's; alert on diff.
Nothing else fires alerts. Other checks (null rate, value distribution, business invariants) live on a dashboard — visible, monitored, but not paging.
Three classes, three populations
┌─────────┐
│ PAGE │ ◄ wakes the owner.
│ │ ~5/week per critical table.
└─────────┘ freshness, volume, schema drift.
┌──────────────┐
│ WARN │ ◄ Slack ping to the owner.
│ │ ~30/week.
└──────────────┘ p95 latency, slow queries, retries.
┌────────────────────────┐
│ INFO (dashboard) │ ◄ no notification.
│ │ hundreds of checks.
│ │ null rates, distributions,
└────────────────────────┘ business invariants.
4 200/week alerts in one Slack channel = no alerts. The pyramid above — about 5 pages, 30 warns, the rest on a dashboard — is what a healthy on-call looks like. Promote a check from INFO to WARN to PAGE only when you'd actually want to be woken up for it.
The owner rule
Every table has one named owner — a human (or a small team alias). The owner's name is in:
- The alert routing — pages go to the owner's pager, not a shared channel.
- The dbt model description —
meta: { owner: 'analytics-platform' }. - The runbook.
Why one owner. Two people sharing on-call for the same table reliably becomes zero people. The bystander effect, applied to your data platform. If two teams genuinely share a table, define a primary and a secondary; the primary pages, the secondary backs up.
Implement a freshness check. The fixture's `orders` table has a `created_at`. Compute the age of the most recent row in hours (`age_hours`) and a flag `is_stale` set to true if older than 24 hours. (Note: the fixture data is from March/April 2024, so `is_stale` will be `true` against any current `now()`.) Return one row, two columns.
Takeaway: three alert classes per critical table — freshness, volume, schema drift. Every table has one named owner. Everything else is a dashboard. The shared #data-alerts channel is the anti-pattern. Fewer, sharper, owned alerts beat 4,200 a week.