AT-301h · Module 3

Building a Failure Pattern Library

3 min read

Over time, failures repeat. Not the exact same failure, but the same category of failure with different agents, different data, different timing. A failure pattern library captures these categories so that when a new failure occurs, the debugging team can match it against known patterns and skip the classification and investigation phases.

Our library currently contains 23 documented failure patterns across the six failure modes. Each entry includes: the pattern name, the failure mode category, the typical symptoms (what the failure looks like from the outside), the typical root causes (what usually caused this pattern in the past), the debugging shortcut (skip the full investigation and check this first), and the standard resolution (the fix that resolved previous instances).

Example: Pattern #7 — "The Thursday Cascade." Symptom: quality scores drop across multiple agents every Thursday afternoon. Root cause: VANGUARD's weekly intelligence brief publishes Thursday morning, triggering 8 agents to simultaneously update their context with new data, creating a context refresh storm that exceeds the system's concurrent processing capacity. Resolution: stagger the intelligence brief distribution across a 2-hour window instead of instant broadcast.

The library saves debugging time because 71.43% of new incidents match an existing pattern within the first 5 minutes of investigation.