MP-301i · Module 2

Rollback Strategies & Post-Mortems

3 min read

Rollback is the fastest remediation for deployment-induced incidents. The ability to roll back an MCP server to the previous version in under 2 minutes is a non-negotiable production capability. This requires: immutable deployment artifacts (container images tagged by version, not "latest"), a deployment system that keeps the previous version available, and a one-command rollback mechanism. Blue-green deployments give you instant rollback — traffic switches from the new (green) version back to the old (blue) version by updating the load balancer. The old version is still running, warmed up, and serving traffic within seconds.

MCP rollbacks have a session continuity challenge. Active sessions were negotiated with the new server version — they may have used new capabilities, new tools, or new protocol features. Rolling back to the old version may invalidate these sessions. The safe approach: drain active sessions before rollback (reject new connections, let in-flight requests complete, send close notifications), then switch traffic to the old version. If the incident is severe enough that draining is too slow, accept the session disruption and switch immediately — clients will reconnect and re-negotiate.

Post-mortems close the incident lifecycle. A good post-mortem documents: the timeline (when did the incident start, when was it detected, when was it resolved), the root cause (not the symptom, the actual cause), the contributing factors (what made the incident worse or harder to resolve), the remediation (what fixed it), and the action items (what changes will prevent recurrence). Post-mortems must be blameless — focus on systems and processes, not individuals. "The deploy pipeline did not run integration tests" is a system problem. "Bob forgot to run tests" is blame that discourages reporting.

Do This

Maintain immutable, versioned deployment artifacts — container images, not filesystem deploys
Keep the previous version running and warm during and after deployment
Drain active sessions before rollback when the incident severity allows it
Write blameless post-mortems focused on systems and processes, not individuals

Avoid This

Deploy by mutating the running instance — rollback requires a full redeploy
Delete the old version after deployment — rollback becomes a rebuild
Hard-cut traffic during rollback without draining — every active session breaks simultaneously
Skip the post-mortem because "we fixed it" — the same incident will recur without systemic changes