AT-201a · Module 2

Error Propagation & Recovery

4 min read

Agents fail. Not sometimes. Regularly. A research agent returns empty results because the web search did not find relevant pages. A drafting agent produces output that does not match the specified format. A review agent gets stuck in a loop repeating the same feedback. A parallel agent silently drops a required field from its output. Failure is not the exception in multi-agent systems. Failure is a design parameter.

The question is not whether agents will fail. The question is what happens when they do. In a system without error recovery, one agent's failure cascades through the entire pipeline. Empty research feeds into the drafter, which produces a draft built on nothing, which the reviewer approves because it is well-written nonsense. The final deliverable looks professional and contains zero substance. Confident nonsense is the worst failure mode because it passes every surface-level quality check.

Error recovery operates at three levels. Level one: retry. The simplest recovery. An agent failed its task — try again, possibly with a refined prompt. "The research agent returned no results for Company X. Retrying with alternative search terms and broader scope." Retry works for transient failures: web search timeouts, context window exhaustion, ambiguous prompt interpretation. If the retry also fails, escalate to level two.

Level two: fallback. The designated agent cannot complete the task. Route it to a different agent or a different approach. "The specialized research agent failed. Dispatching a general-purpose agent with broader search capabilities." Fallback works when the failure is specific to the agent's approach, not the task itself. If the fallback also fails, escalate to level three.

Level three: graceful degradation. The task genuinely cannot be completed. The coordinator acknowledges the gap, documents what was attempted, and proceeds with partial results. "Competitor analysis for Company X is unavailable — two research attempts and one fallback returned no data. Proceeding with analysis of the remaining four competitors. Company X flagged for manual research." Graceful degradation is not failure. It is honest reporting of what the system could and could not accomplish.

Do This

Build retry logic into every agent dispatch — transient failures are common
Define fallback agents for critical pipeline stages
Degrade gracefully: proceed with partial results and flag gaps for human review
Log every failure, retry, and fallback for post-session analysis

Avoid This

Let empty results cascade through the pipeline — catch failures at the source
Retry the same prompt without modification — change parameters, broaden scope, or rephrase
Treat partial results as total failure — four of five competitors is still valuable analysis
Hide failures from the user — transparent error reporting builds trust