CDX-201c · Module 3

Error Recovery & Retry Strategies

4 min read

In multi-agent pipelines, failures are not exceptional — they are expected. An agent can fail because of a model error, a timeout, a dependency issue, a sandbox restriction, or simply because the task was poorly scoped. The question is not whether agents will fail but how the pipeline recovers when they do.

Three recovery patterns handle most failure scenarios. Retry with backoff: the simplest pattern — resubmit the failed task with the same prompt, optionally with a higher reasoning effort or different model. Context-enriched retry: add the failure reason to the prompt ("the previous attempt failed because X — avoid that approach") so the agent learns from the failure. Fallback escalation: if an agent fails repeatedly, escalate to a more capable model or to a human reviewer.

from agents import Agent, Runner
import time

def run_with_retry(agent, task, max_retries=3):
    """Run an agent task with exponential backoff and context enrichment."""
    context = ""
    for attempt in range(max_retries):
        try:
            prompt = f"{task}\n\n{context}" if context else task
            result = Runner.run(agent, prompt)
            return result
        except Exception as e:
            wait = 2 ** attempt  # 1s, 2s, 4s
            context = (
                f"Previous attempt {attempt + 1} failed: {str(e)}. "
                f"Avoid the approach that caused this failure."
            )
            time.sleep(wait)
    
    # Escalate to human
    raise RuntimeError(
        f"Agent {agent.name} failed after {max_retries} attempts. "
        f"Last error: {context}"
    )

Do This

Implement context-enriched retry for all automated agent tasks
Set a maximum retry count — infinite retries waste budget on unsolvable tasks
Log every failure with full context for post-mortem analysis
Escalate to a human when automated recovery is exhausted

Avoid This

Retry indefinitely — three failures usually indicates a scoping or configuration issue
Use blind retry without adding failure context — the agent will make the same mistake
Swallow errors silently — failed tasks that nobody notices are the most expensive failures
Assume model errors are transient — check if the task itself is malformed before retrying