CDX-301h · Module 2

Retry Policies & Budgets

3 min read

Retry policies define how the supervisor responds when a worker fails. The naive approach — retry the same prompt until it works — is expensive and often ineffective. Expert retry policies distinguish between transient failures (model timeout, rate limit, sandbox error) and persistent failures (task too complex, incorrect decomposition, missing context). Transient failures warrant automatic retry with backoff. Persistent failures warrant context enrichment, model escalation, or task re-decomposition.

Retry budgets prevent runaway costs. Each task gets a maximum retry count (typically 2-3) and a maximum token budget (the original task cost multiplied by a factor, typically 3x). If the task exhausts either budget without succeeding, the supervisor escalates rather than retrying. The budget should account for increasing costs: context-enriched retries are more expensive than the original attempt because the failure context adds tokens.

from dataclasses import dataclass, field
from enum import Enum

class FailureType(Enum):
    TRANSIENT = "transient"     # timeout, rate limit
    PERSISTENT = "persistent"   # wrong approach, missing context
    FATAL = "fatal"             # sandbox violation, auth error

@dataclass
class RetryPolicy:
    max_retries: int = 3
    max_cost_multiplier: float = 3.0
    backoff_seconds: list[int] = field(
        default_factory=lambda: [2, 5, 15]
    )

    def should_retry(self, failure_type: FailureType,
                     attempt: int, total_cost: float,
                     original_cost: float) -> bool:
        if failure_type == FailureType.FATAL:
            return False  # Never retry fatal errors
        if attempt >= self.max_retries:
            return False  # Budget exhausted
        if total_cost > original_cost * self.max_cost_multiplier:
            return False  # Cost ceiling hit
        return True

    def get_backoff(self, attempt: int) -> int:
        idx = min(attempt, len(self.backoff_seconds) - 1)
        return self.backoff_seconds[idx]

Classify the failure Before retrying, determine if the failure is transient (retry will likely succeed), persistent (needs a different approach), or fatal (cannot be retried). This classification drives the retry strategy.
Set explicit budgets Define max retries (2-3) and max cost multiplier (3x original) for each task. Publish these limits in the pipeline configuration so the supervisor enforces them consistently.
Log every retry Record the failure type, retry strategy, and outcome. This data reveals patterns — if the same task type fails repeatedly, the decomposition or prompt needs adjustment.