DS-201b · Module 3

Anomaly Detection Systems

4 min read

The most valuable dashboard alert is the one you did not know to set. Static thresholds catch problems you anticipated. Anomaly detection catches problems you did not anticipate. And the problems you did not anticipate are usually the ones that matter most.

Anomaly detection uses AI to learn the normal pattern of every metric — its daily, weekly, and seasonal rhythms — and flags when the actual value deviates from the expected pattern by a statistically significant margin. No human needs to define the threshold. The AI learns what "normal" looks like and alerts when reality diverges.

ANOMALY DETECTION ARCHITECTURE
================================

INPUT:
  Historical metric data (minimum 90 days)
  Seasonal patterns (weekly, monthly, quarterly cycles)
  Known events (holidays, launches, promotions)

MODEL:
  Expected value = baseline + seasonal adjustment +
                   trend + known-event adjustment
  Anomaly score = |actual - expected| / historical_std_dev

OUTPUT:
  Score 0-1:  Normal (no alert)
  Score 1-2:  Minor anomaly (log, no alert)
  Score 2-3:  Significant anomaly (YELLOW alert)
  Score 3+:   Critical anomaly (RED alert)

EXAMPLE:
  Metric: Daily website conversions
  Expected: 42 (Tuesday average, adjusted for Feb)
  Actual: 18
  Anomaly score: 3.4 (critical)

  AI DIAGNOSIS: "Conversion form returning 500 error
  since 2:14 AM. 24 hours of lead capture lost.
  Engineering ticket auto-created."

The power of anomaly detection increases with the breadth of metrics monitored. A human analyst can track 10-15 metrics with manual threshold checks. AI monitors 500+ metrics simultaneously, each with learned patterns, seasonal adjustments, and correlation-aware alerting.

Correlation-aware alerting is the advanced pattern. When website traffic drops AND conversion rate drops AND email delivery rate drops simultaneously, the alert does not fire three times. It fires once with the diagnosis: "Email service provider outage affecting delivery, traffic, and conversions. Single root cause." That correlation saves the team from chasing three separate problems that are actually one problem.

Do This

Deploy anomaly detection across all tracked metrics — AI handles the volume
Use correlation-aware alerting to consolidate related anomalies into single root causes
Require 90 days of historical data before activating anomaly detection — the model needs patterns

Avoid This

Rely exclusively on static thresholds — they only catch problems you anticipated
Alert on every anomaly score above 1.0 — that generates noise that kills response speed
Skip the known-event adjustment — every holiday and product launch will trigger false alarms