CDX-301g · Module 3
Rollback on Failure
3 min read
Rollback is the safety net for multi-agent pipelines. When a quality gate fails and context-enriched retry cannot fix the issue, the pipeline must revert to a known-good state. Without rollback, a failed pipeline leaves the codebase in a partially modified state — some agents' changes applied, others not — that is worse than the original state. Rollback strategies must be designed into the pipeline from the start, not bolted on after the first failure.
Git provides the natural rollback mechanism for code-producing agents. Each agent works on a branch, and rollback means deleting or resetting that branch. For the overall pipeline, a pre-pipeline tag or commit hash marks the rollback point. Three rollback scopes cover production needs: agent-level rollback (reset one agent's branch, re-run that agent), pipeline-level rollback (reset everything to the pre-pipeline state, start over), and partial rollback (keep successful agents' work, re-run only the failed agents with adjusted prompts).
import subprocess
from dataclasses import dataclass
@dataclass
class PipelineCheckpoint:
commit_hash: str
agent_branches: dict[str, str] # agent_name -> branch_name
def create_checkpoint() -> PipelineCheckpoint:
"""Capture the current state before pipeline execution."""
result = subprocess.run(
["git", "rev-parse", "HEAD"],
capture_output=True, text=True
)
return PipelineCheckpoint(
commit_hash=result.stdout.strip(),
agent_branches={},
)
def rollback_agent(checkpoint: PipelineCheckpoint, agent: str):
"""Rollback a single agent's branch."""
branch = checkpoint.agent_branches.get(agent)
if branch:
subprocess.run(["git", "branch", "-D", branch])
print(f"Rolled back {agent}: deleted branch {branch}")
def rollback_pipeline(checkpoint: PipelineCheckpoint):
"""Full pipeline rollback to pre-execution state."""
subprocess.run(["git", "reset", "--hard", checkpoint.commit_hash])
for agent, branch in checkpoint.agent_branches.items():
subprocess.run(["git", "branch", "-D", branch],
capture_output=True) # ignore if missing
print(f"Pipeline rolled back to {checkpoint.commit_hash}")
Do This
- Create a checkpoint (tag or commit hash) before every pipeline run
- Design for partial rollback — successful agents should produce independently valid output
- Automate rollback as part of the pipeline definition, not as a manual emergency procedure
- Log the rollback reason for post-mortem analysis — repeated rollbacks indicate a systemic issue
Avoid This
- Run pipelines without a rollback plan — the first unrecoverable failure will cost hours of manual cleanup
- Assume "git stash" is a rollback strategy — it does not handle multi-branch agent workflows
- Roll back the entire pipeline when one agent fails — partial rollback preserves successful work
- Ignore rollback frequency — if you roll back more than 20% of pipeline runs, the decomposition needs work