OC-301g · Module 3

Alert Engineering

3 min read

Alert fatigue kills monitoring. When the on-call engineer receives 50 alerts per day, they stop reading them. When they stop reading them, the one alert that matters — the one that precedes a production incident — is buried in noise. Alert engineering is the discipline of making every alert actionable: if the alert fires, the recipient must be able to take a specific action in response.

The alert design checklist: What triggered it? (the metric and threshold). What does it mean? (the operational impact in plain language). What should the recipient do? (specific action or runbook link). How urgent is it? (severity level with defined SLA). Every alert that fails any of these four criteria is either redesigned or deleted. An alert without a clear action is a notification, not an alert — and notifications belong in a dashboard, not a pager.

Do This

  • Every alert includes: trigger, impact, action, and urgency — all four are required
  • Link every critical alert to a runbook — the on-call engineer should not have to guess what to do
  • Review alert volume monthly — if alerts-per-day exceeds 5, investigate which alerts are non-actionable

Avoid This

  • Alert on every metric threshold breach without context — "CPU at 81%" is not actionable without impact assessment
  • Set identical thresholds for all agents — different agents have different baseline behaviors
  • Create alerts without runbooks — an alert that requires "figure out what to do" wastes incident response time