OC-301g · Module 3

On-Call for Agent Systems

3 min read

On-call for agent systems differs from traditional on-call in one critical way: the system does not wait for the engineer. A web application that breaks stops serving requests until someone fixes it. An agent system that breaks may continue operating — producing wrong output, making bad decisions, or taking actions based on corrupted data. The on-call response must include not just fixing the problem but assessing the blast radius: what did the agent do while it was broken?

The on-call procedure for agent incidents: Step one — contain. Pause the affected agent or quarantine its output. Do not let it continue operating while you investigate. Step two — assess blast radius. What tasks did the agent complete since the anomaly began? Were any outputs delivered to external stakeholders? Were any downstream agents affected by the corrupted output? Step three — remediate. Fix the root cause. Step four — recover. Reprocess affected tasks from before the incident. Notify stakeholders if corrupted output was delivered. Step five — document. Add the incident to the runbook and create a regression test.

1. Contain Immediately Pause the agent or quarantine its output queue. An agent producing bad output that reaches stakeholders causes more damage every minute it continues operating. Contain first, investigate second.
2. Assess Blast Radius Review the agent's activity since the anomaly began: tasks completed, outputs delivered, decisions made, downstream agents affected. The blast radius determines the scope of the recovery effort.
3. Remediate and Recover Fix the root cause. Reprocess affected tasks. Notify stakeholders if corrupted output was delivered externally. Document the incident and add a regression test to prevent recurrence.