AT-301h · Module 1
Failure Classification
3 min read
Before debugging, classify the failure. The classification determines the debugging method — and applying the wrong method wastes time investigating the wrong layer.
Classification matrix: Is the output wrong (factual error, format error, quality below threshold) or is there no output (timeout, silent failure, deadlock)? Wrong output → investigate the data chain: what inputs did the failing agent receive, and were they correct? No output → investigate the operational chain: is the agent running, is it receiving messages, is it stuck?
Within wrong-output failures: is the error consistent (every execution fails the same way) or intermittent (sometimes correct, sometimes wrong)? Consistent errors point to systematic causes — bad prompts, incorrect role contracts, schema mismatches. Intermittent errors point to state-dependent causes — race conditions, stale context, timing-sensitive handoffs.
This two-axis classification (output type x consistency) narrows the search space from "anything could be wrong" to a specific failure category with a known debugging protocol.
Do This
- Classify before investigating — wrong method wastes more time than wrong hypothesis
- Check output type first: wrong output vs. no output narrows the search by 50%
- Check consistency second: consistent vs. intermittent selects the debugging protocol
Avoid This
- Start reading logs from the beginning — in a 20-agent system, the logs are overwhelming
- Assume the failing agent is the root cause — 62.17% of failures originate upstream
- Debug intermittent failures with single-execution analysis — you need a pattern across multiple runs