MP-301c · Module 3

Disaster Recovery

3 min read

Disaster recovery for MCP servers covers two scenarios: infrastructure failure (server goes down, database corrupted, cloud region outage) and data corruption (bad deployment writes invalid data, tool handler corrupts shared state). For infrastructure failure, the recovery plan is redundancy: multiple server instances behind a load balancer, database replicas in different availability zones, and automated failover that promotes a replica when the primary fails. For data corruption, the recovery plan is backups plus rollback: regular state snapshots, point-in-time recovery, and a tested procedure for restoring from backup.

The most overlooked part of disaster recovery is testing. A backup that has never been restored is not a backup — it is a hope. A failover procedure that has never been executed is not a procedure — it is a guess. Schedule quarterly DR drills: simulate a primary database failure, verify automatic failover works, restore from a backup and verify data integrity, measure how long each step takes. The drill reveals gaps in your runbook, missing automation, and incorrect assumptions about recovery time. Fix these in calm conditions, not during an actual outage.

For MCP servers specifically, DR planning must account for client reconnection behavior. When your server goes down and comes back (or fails over to a replica), every connected client loses its session. Clients will attempt to reconnect, list tools, and resume their conversation. Your server must handle this burst of simultaneous reconnections without falling over — a thundering herd at exactly the worst time. Rate-limit the reconnection flow, prioritize tools/list responses (they are cheap and unblock the client), and defer expensive operations until the reconnection storm passes.

// DR runbook as executable code
interface DRStep {
  name: string;
  check: () => Promise<boolean>;
  fix: () => Promise<void>;
  timeoutMs: number;
}

const runbook: DRStep[] = [
  {
    name: "Primary database connectivity",
    check: async () => {
      try {
        await pool.query("SELECT 1");
        return true;
      } catch { return false; }
    },
    fix: async () => {
      // Promote replica to primary
      console.error("Promoting read replica to primary...");
      await switchDatabaseEndpoint(process.env.REPLICA_URL!);
    },
    timeoutMs: 10_000,
  },
  {
    name: "Tool call success rate",
    check: async () => {
      const stats = getRecentToolStats(60); // last 60 seconds
      return stats.errorRate < 0.1; // < 10% errors
    },
    fix: async () => {
      // Circuit-break failing tools
      console.error("Circuit-breaking tools with > 50% error rate...");
      const stats = getPerToolStats(60);
      for (const [tool, s] of Object.entries(stats)) {
        if (s.errorRate > 0.5) disableTool(tool);
      }
    },
    timeoutMs: 5_000,
  },
  {
    name: "Reconnection storm protection",
    check: async () => {
      const rate = getConnectionRate(10); // last 10 seconds
      return rate < 100; // < 100 new connections/sec
    },
    fix: async () => {
      console.error("Enabling connection rate limiting...");
      setConnectionRateLimit(50); // max 50/sec, queue the rest
    },
    timeoutMs: 5_000,
  },
];

async function executeDRRunbook() {
  for (const step of runbook) {
    const ok = await Promise.race([
      step.check(),
      new Promise<boolean>(r => setTimeout(() => r(false), step.timeoutMs)),
    ]);
    if (!ok) {
      console.error(`DR: "${step.name}" FAILED — executing fix...`);
      await step.fix();
    } else {
      console.error(`DR: "${step.name}" OK`);
    }
  }
}

Define recovery objectives Set RTO (recovery time objective — how fast you must recover) and RPO (recovery point objective — how much data loss is acceptable). These numbers drive your architecture decisions.
Automate the runbook Convert every recovery step into a check-and-fix function. Run the runbook automatically when the health check detects degradation, or manually when alerted.
Drill quarterly Simulate a primary failure, execute the runbook, measure actual recovery time, and document gaps. Update the runbook based on what you learned. The drill is the test.