Codabra

Designing alerts that fire when they should — and only when they should

Alert fatigue is the easiest way to miss a real outage. Three classes, owner rule, and the rule of thumb that's saved more on-calls than coffee.

The 4,200 alerts nobody read

A team I knew had a Slack channel called #data-alerts. It got 4,200 messages a week. Most were noise — a single null in a column that allowed nulls, a query 1 second slower than the SLA, a backfill that took 11% longer.

The channel was muted within a month of every new joiner. When a real outage hit — fct_orders was empty for 3 hours during the morning load — nobody noticed. The channel had been delivering 600 alerts/day; the empty-table alert was one of them.

The fix is fewer, sharper alerts with named owners. Three classes per critical table; everything else is a dashboard, not a page. Today, that.

Three alert classes per critical table — that's it.

  1. Freshnessthe table has been updated within its SLO. MAX(loaded_at) more than N hours old → page. The single most useful alert in data engineering.
  2. Volumetoday's row count is within historical band. MAD-based check (median + N*MAD) over the last 30 days. Catches both sudden drops (pipeline broken) and sudden spikes (bot traffic, duplicate insert).
  3. Schema driftno unexpected column rename, removal, or type change. Compare today's information_schema.columns to yesterday's; alert on diff.

Nothing else fires alerts. Other checks (null rate, value distribution, business invariants) live on a dashboard — visible, monitored, but not paging.

Three classes, three populations

                   ┌─────────┐
                   │  PAGE   │  ◄ wakes the owner.
                   │         │     ~5/week per critical table.
                   └─────────┘     freshness, volume, schema drift.

                ┌──────────────┐
                │     WARN     │  ◄ Slack ping to the owner.
                │              │     ~30/week.
                └──────────────┘     p95 latency, slow queries, retries.

          ┌────────────────────────┐
          │    INFO (dashboard)    │  ◄ no notification.
          │                        │     hundreds of checks.
          │                        │     null rates, distributions,
          └────────────────────────┘     business invariants.

4 200/week alerts in one Slack channel = no alerts. The pyramid above — about 5 pages, 30 warns, the rest on a dashboard — is what a healthy on-call looks like. Promote a check from INFO to WARN to PAGE only when you'd actually want to be woken up for it.

The owner rule

Every table has one named owner — a human (or a small team alias). The owner's name is in:

  • The alert routing — pages go to the owner's pager, not a shared channel.
  • The dbt model descriptionmeta: { owner: 'analytics-platform' }.
  • The runbook.

Why one owner. Two people sharing on-call for the same table reliably becomes zero people. The bystander effect, applied to your data platform. If two teams genuinely share a table, define a primary and a secondary; the primary pages, the secondary backs up.

Implement a freshness check. The fixture's `orders` table has a `created_at`. Compute the age of the most recent row in hours (`age_hours`) and a flag `is_stale` set to true if older than 24 hours. (Note: the fixture data is from March/April 2024, so `is_stale` will be `true` against any current `now()`.) Return one row, two columns.

Takeaway: three alert classes per critical table — freshness, volume, schema drift. Every table has one named owner. Everything else is a dashboard. The shared #data-alerts channel is the anti-pattern. Fewer, sharper, owned alerts beat 4,200 a week.