DS-301a · Module 2
Alert Design
3 min read
Alert fatigue is the silent killer of monitoring systems. An organization that receives fifty alerts a day treats all fifty as noise. An organization that receives three alerts a day treats all three as signals. The difference is not the detection system — it is the alert design. Every alert that fires without warranting action trains the recipient to ignore the next alert. After two weeks of false positives, the real alert — the one that matters — gets the same treatment as the noise. It gets ignored. The system worked perfectly. The design failed.
Effective alert design uses severity tiers. Critical alerts demand immediate action — system outages, security breaches, data pipeline failures. These go to pagers and interrupt whatever the recipient is doing. Warning alerts signal degradation that needs attention within hours — elevated error rates, approaching capacity limits, unusual metric deviations. These go to Slack channels and email. Informational alerts log noteworthy events for later review — minor anomalies, trend changes, threshold approaches. These populate a dashboard but do not interrupt anyone. The tier determines the channel. The channel determines the attention. Getting the tier wrong in either direction is equally harmful.
- Define Severity Tiers Three tiers maximum. Critical: immediate action required, pages on-call. Warning: action required within 4 hours, Slack notification. Informational: review at next business day, dashboard only. More than three tiers creates ambiguity.
- Set Routing Rules Each alert type routes to a specific person or team based on the metric domain. Revenue anomalies route to RevOps. Infrastructure alerts route to engineering. Customer health alerts route to success. Generic routing to "the team" means nobody owns the response.
- Build Escalation Paths If a critical alert is not acknowledged within 15 minutes, escalate. If a warning is not resolved within 4 hours, escalate. Escalation is not punishment — it is a safety net. Define who gets escalated to and what authority they have to resolve.