AT-301g · Module 3

Incident Response

4 min read

When an alert fires, the incident response protocol activates. The protocol has five phases, each time-boxed to prevent runaway investigations that consume more resources than the incident itself.

Phase 1 — Triage (2 minutes): confirm the alert is real (not a false positive), classify severity, and assign an owner. Phase 2 — Contain (10 minutes): prevent the issue from spreading. If Agent A is producing bad outputs, halt its downstream handoffs. If a cascade is forming, isolate the root agent. Phase 3 — Diagnose (30 minutes): identify the root cause using the observability layers, correlation data, and message chain traces. Phase 4 — Resolve (time varies): implement the fix, verify through the quality loop, and restore normal operations. Phase 5 — Document (15 minutes): record the incident for the post-mortem process.

The time-boxing is critical. A Triage phase that runs 20 minutes instead of 2 means the containment window was wasted — the incident spreads while the team debates severity. Hard time limits force decisions with imperfect information, which is the correct operating mode for incident response.

Triage (2 min) Confirm the alert. Check for false positive indicators. Classify severity (S1-S4). Assign an owner. If severity is S1 or S2, skip to containment immediately.
Contain (10 min) Stop the bleeding. Halt downstream handoffs from the affected agent. Reroute critical work to alternates. Prevent cascade propagation. Containment is not resolution — it is damage limitation.
Diagnose → Resolve → Document Identify root cause through observability data. Implement fix. Verify through quality loop. Restore operations. Document the incident with full timeline, root cause, and resolution for the post-mortem process.