MP-301i · Module 2

Runbooks for MCP Operations

4 min read

A runbook is a documented procedure for diagnosing and resolving a specific operational issue. Every alert in your monitoring system should link to a runbook. The runbook answers four questions: What is happening? (symptom description and severity), How do I confirm? (diagnostic commands to verify the issue), How do I fix it? (step-by-step remediation), and How do I verify the fix? (commands to confirm the system is healthy). A runbook transforms a 3 AM page from "figure it out under pressure" to "follow these steps." The quality of your runbooks determines whether your on-call engineers fix issues in 5 minutes or 50.

MCP-specific runbooks should cover the failure modes unique to MCP infrastructure. Session storms: too many clients reconnecting simultaneously after a server restart, overwhelming the initialization handler. Token refresh cascades: a batch of expired tokens triggering concurrent refresh requests that rate-limit the authorization server. Tool timeout spirals: a slow downstream dependency causing tool timeouts, which cause client retries, which amplify the load on the already-struggling dependency. Each of these has a specific diagnostic signature and remediation — and each is different from standard web application failures.

Runbook maintenance is as important as runbook creation. A runbook written for version 1.0 may be wrong for version 2.0. After every incident, update the runbook with what actually happened and what actually fixed it. If the runbook was wrong — the commands did not work, the steps were in the wrong order, a step was missing — fix it immediately. Date-stamp every update. Review all runbooks quarterly. A stale runbook is worse than no runbook because it gives false confidence during an incident.

# Runbook: MCP Session Storm

## Symptom
- CPU > 90% on MCP server instances
- initialize request latency > 10s (normal: < 200ms)
- Connection queue growing, new clients timing out

## Diagnosis
```bash
# Check active session count
curl -s http://mcp-server:9090/metrics | grep mcp_active_sessions

# Check initialization rate (should be < 100/sec normally)
curl -s http://mcp-server:9090/metrics | grep mcp_initialize_total

# Check if a recent restart triggered reconnections
journalctl -u mcp-server --since "10 minutes ago" | grep "started\|stopped"
```

## Remediation
1. **Rate-limit initializations**: Set MAX_INIT_PER_SECOND=50 via env var
2. **Scale horizontally**: Add 2 instances via autoscaler override
3. **Stagger client reconnections**: Push config update to clients
   to add random 0-30s jitter on reconnection

## Verify
```bash
# CPU should drop below 60% within 5 minutes
# Initialize latency should return to < 200ms
curl -s http://mcp-server:9090/metrics | grep mcp_initialize_duration
```