CDX-301g · Module 3

Rollback on Failure

3 min read

Rollback is the safety net for multi-agent pipelines. When a quality gate fails and context-enriched retry cannot fix the issue, the pipeline must revert to a known-good state. Without rollback, a failed pipeline leaves the codebase in a partially modified state — some agents' changes applied, others not — that is worse than the original state. Rollback strategies must be designed into the pipeline from the start, not bolted on after the first failure.

Git provides the natural rollback mechanism for code-producing agents. Each agent works on a branch, and rollback means deleting or resetting that branch. For the overall pipeline, a pre-pipeline tag or commit hash marks the rollback point. Three rollback scopes cover production needs: agent-level rollback (reset one agent's branch, re-run that agent), pipeline-level rollback (reset everything to the pre-pipeline state, start over), and partial rollback (keep successful agents' work, re-run only the failed agents with adjusted prompts).

import subprocess
from dataclasses import dataclass

@dataclass
class PipelineCheckpoint:
    commit_hash: str
    agent_branches: dict[str, str]  # agent_name -> branch_name

def create_checkpoint() -> PipelineCheckpoint:
    """Capture the current state before pipeline execution."""
    result = subprocess.run(
        ["git", "rev-parse", "HEAD"],
        capture_output=True, text=True
    )
    return PipelineCheckpoint(
        commit_hash=result.stdout.strip(),
        agent_branches={},
    )

def rollback_agent(checkpoint: PipelineCheckpoint, agent: str):
    """Rollback a single agent's branch."""
    branch = checkpoint.agent_branches.get(agent)
    if branch:
        subprocess.run(["git", "branch", "-D", branch])
        print(f"Rolled back {agent}: deleted branch {branch}")

def rollback_pipeline(checkpoint: PipelineCheckpoint):
    """Full pipeline rollback to pre-execution state."""
    subprocess.run(["git", "reset", "--hard", checkpoint.commit_hash])
    for agent, branch in checkpoint.agent_branches.items():
        subprocess.run(["git", "branch", "-D", branch],
                       capture_output=True)  # ignore if missing
    print(f"Pipeline rolled back to {checkpoint.commit_hash}")

Do This

Create a checkpoint (tag or commit hash) before every pipeline run
Design for partial rollback — successful agents should produce independently valid output
Automate rollback as part of the pipeline definition, not as a manual emergency procedure
Log the rollback reason for post-mortem analysis — repeated rollbacks indicate a systemic issue

Avoid This

Run pipelines without a rollback plan — the first unrecoverable failure will cost hours of manual cleanup
Assume "git stash" is a rollback strategy — it does not handle multi-branch agent workflows
Roll back the entire pipeline when one agent fails — partial rollback preserves successful work
Ignore rollback frequency — if you roll back more than 20% of pipeline runs, the decomposition needs work