RC-401b · Module 3
Incident Response
3 min read
Agent incidents are not software incidents. When traditional software fails, it stops working. When an agent fails, it might keep working — confidently producing incorrect output, executing unauthorized actions, or leaking data through side channels it was never supposed to access. The failure mode is not a crash. It is silent drift.
Incident response for agent operations pulls from three tracks. AS provides the security incident framework: detection, containment, eradication, recovery, lessons learned. OC provides the infrastructure incident procedures: hardware failure, network partition, API provider outage, credential rotation under duress. AT provides conflict resolution: what happens when agents in a team produce contradictory outputs, when a critic agent deadlocks with a specialist, or when the lead's synthesis diverges from the evidence.
- Detection: Recognize the Anomaly Your monitoring stack (Lesson 6) generates alerts when metrics deviate from baseline. But agent incidents have a unique detection challenge: the agent may be producing plausible-looking output that is subtly wrong. Train your monitoring to check output quality, not just output existence. If an agent that normally takes 45 seconds starts completing in 3 seconds, that is not a performance improvement — it is a sign the agent is skipping steps.
- Containment: Stop the Blast Radius The first action in any agent incident is containment, not diagnosis. Revoke the agent's external API access. Pause all pending actions in the queue. If the agent is part of a team, isolate it from the lead to prevent contaminated output from flowing to other specialists. Containment takes 30 seconds. Diagnosis can take hours. Do not let the agent continue operating while you figure out what went wrong.
- Resolution: Fix, Verify, Restore After containment, diagnose the root cause from logs and action history. Apply the fix — prompt correction, configuration change, credential rotation, or full rollback. Run the fixed agent through your complete test pipeline (all three validation layers). Only after all tests pass do you restore external access. Document the incident: what triggered it, how it was detected, time-to-containment, root cause, and the specific prevention measure added to stop recurrence.