CDX-201c · Module 3
Error Recovery & Retry Strategies
4 min read
In multi-agent pipelines, failures are not exceptional — they are expected. An agent can fail because of a model error, a timeout, a dependency issue, a sandbox restriction, or simply because the task was poorly scoped. The question is not whether agents will fail but how the pipeline recovers when they do.
Three recovery patterns handle most failure scenarios. Retry with backoff: the simplest pattern — resubmit the failed task with the same prompt, optionally with a higher reasoning effort or different model. Context-enriched retry: add the failure reason to the prompt ("the previous attempt failed because X — avoid that approach") so the agent learns from the failure. Fallback escalation: if an agent fails repeatedly, escalate to a more capable model or to a human reviewer.
from agents import Agent, Runner
import time
def run_with_retry(agent, task, max_retries=3):
"""Run an agent task with exponential backoff and context enrichment."""
context = ""
for attempt in range(max_retries):
try:
prompt = f"{task}\n\n{context}" if context else task
result = Runner.run(agent, prompt)
return result
except Exception as e:
wait = 2 ** attempt # 1s, 2s, 4s
context = (
f"Previous attempt {attempt + 1} failed: {str(e)}. "
f"Avoid the approach that caused this failure."
)
time.sleep(wait)
# Escalate to human
raise RuntimeError(
f"Agent {agent.name} failed after {max_retries} attempts. "
f"Last error: {context}"
)
Do This
- Implement context-enriched retry for all automated agent tasks
- Set a maximum retry count — infinite retries waste budget on unsolvable tasks
- Log every failure with full context for post-mortem analysis
- Escalate to a human when automated recovery is exhausted
Avoid This
- Retry indefinitely — three failures usually indicates a scoping or configuration issue
- Use blind retry without adding failure context — the agent will make the same mistake
- Swallow errors silently — failed tasks that nobody notices are the most expensive failures
- Assume model errors are transient — check if the task itself is malformed before retrying