AT-301g · Module 2

Alert Engineering

3 min read

Alert fatigue kills observability. If the monitoring system fires 50 alerts per day, the coordinator starts ignoring them — and the 51st alert, which was critical, gets the same treatment as the 50 noise alerts before it.

The alert engineering methodology: every alert must pass three tests before activation. The Signal Test: does this alert detect a problem that affects output quality or system stability? If not, it is a metric, not an alert. The Actionability Test: when this alert fires, is there a clear next step? If the response is "look at the dashboard and figure it out," the alert needs a more specific trigger. The Frequency Test: will this alert fire less than 3 times per week under normal operations? If it fires more frequently, the threshold is too sensitive.

Our system runs 34 active alerts across all three observability layers. Average alert volume: 2.7 per day. False positive rate: 8.41%. Every alert has a documented runbook — when it fires, the responder knows exactly what to check, in what order, and what the resolution options are.