AT-301h · Module 2

Failure Isolation

3 min read

When the trace identifies a suspect agent but the root cause is not immediately obvious, isolation testing confirms or eliminates the hypothesis. The technique: reproduce the scenario with the suspect agent operating in isolation — remove it from the team, feed it the same inputs, and observe whether it produces the same failure.

Three isolation strategies. Full Isolation: remove the agent from the team entirely, run it standalone with recorded inputs. If it produces the same failure, the problem is within the agent (prompt, role contract, capability). If it succeeds, the problem is environmental (timing, concurrent load, upstream data). Pairwise Isolation: run just the suspect agent and its immediate upstream neighbor. Tests whether the handoff between these two agents is the failure point. Replay Isolation: record all messages from a failure incident and replay them through the suspect agent. Tests whether the failure is reproducible from the same inputs.

Full isolation resolves the hypothesis in 78.34% of cases. Pairwise isolation catches handoff-layer issues that full isolation misses. Replay isolation catches timing-sensitive failures that neither static method catches.

Full Isolation Remove the agent from the team. Feed it the recorded inputs from the failure. If it fails again — the problem is in the agent. If it succeeds — the problem is in the environment.
Pairwise Isolation Run the suspect agent with only its upstream neighbor. Feed the upstream agent the original trigger. If the failure reproduces — the handoff interface is the issue. If it succeeds — the failure requires broader system interaction.
Replay Isolation Record the full message chain from the failure incident. Replay it through the suspect agent at the same timing intervals. If the failure reproduces — it is deterministically caused by those inputs. If not — the failure is sensitive to timing or concurrent load.