AT-301g · Module 3
Incident Response
4 min read
When an alert fires, the incident response protocol activates. The protocol has five phases, each time-boxed to prevent runaway investigations that consume more resources than the incident itself.
Phase 1 — Triage (2 minutes): confirm the alert is real (not a false positive), classify severity, and assign an owner. Phase 2 — Contain (10 minutes): prevent the issue from spreading. If Agent A is producing bad outputs, halt its downstream handoffs. If a cascade is forming, isolate the root agent. Phase 3 — Diagnose (30 minutes): identify the root cause using the observability layers, correlation data, and message chain traces. Phase 4 — Resolve (time varies): implement the fix, verify through the quality loop, and restore normal operations. Phase 5 — Document (15 minutes): record the incident for the post-mortem process.
The time-boxing is critical. A Triage phase that runs 20 minutes instead of 2 means the containment window was wasted — the incident spreads while the team debates severity. Hard time limits force decisions with imperfect information, which is the correct operating mode for incident response.
- Triage (2 min) Confirm the alert. Check for false positive indicators. Classify severity (S1-S4). Assign an owner. If severity is S1 or S2, skip to containment immediately.
- Contain (10 min) Stop the bleeding. Halt downstream handoffs from the affected agent. Reroute critical work to alternates. Prevent cascade propagation. Containment is not resolution — it is damage limitation.
- Diagnose → Resolve → Document Identify root cause through observability data. Implement fix. Verify through quality loop. Restore operations. Document the incident with full timeline, root cause, and resolution for the post-mortem process.