AT-301h · Module 1

Failure Classification

3 min read

Before debugging, classify the failure. The classification determines the debugging method — and applying the wrong method wastes time investigating the wrong layer.

Classification matrix: Is the output wrong (factual error, format error, quality below threshold) or is there no output (timeout, silent failure, deadlock)? Wrong output → investigate the data chain: what inputs did the failing agent receive, and were they correct? No output → investigate the operational chain: is the agent running, is it receiving messages, is it stuck?

Within wrong-output failures: is the error consistent (every execution fails the same way) or intermittent (sometimes correct, sometimes wrong)? Consistent errors point to systematic causes — bad prompts, incorrect role contracts, schema mismatches. Intermittent errors point to state-dependent causes — race conditions, stale context, timing-sensitive handoffs.

This two-axis classification (output type x consistency) narrows the search space from "anything could be wrong" to a specific failure category with a known debugging protocol.

Do This

Classify before investigating — wrong method wastes more time than wrong hypothesis
Check output type first: wrong output vs. no output narrows the search by 50%
Check consistency second: consistent vs. intermittent selects the debugging protocol

Avoid This

Start reading logs from the beginning — in a 20-agent system, the logs are overwhelming
Assume the failing agent is the root cause — 62.17% of failures originate upstream
Debug intermittent failures with single-execution analysis — you need a pattern across multiple runs