MP-301i · Module 2

On-Call & Escalation Patterns

3 min read

On-call for MCP infrastructure follows the same principles as any production system, with MCP-specific additions. The first responder needs access to: the MCP server logs (structured, queryable), the metrics dashboard (session counts, tool latency, error rates), the deployment pipeline (to roll back if needed), and the runbook library. They also need the ability to restart MCP server instances, scale the fleet, and invalidate session tokens. Grant these permissions to the on-call role, not to individual engineers — and revoke them at the end of each rotation.

Escalation tiers for MCP incidents follow severity. Tier 1 (first responder) handles known issues with existing runbooks — restart a crashed instance, scale up during a traffic spike, clear a stuck session. Tier 2 (senior engineer) handles unknown issues that do not match any runbook — new failure modes, complex cascading failures, data inconsistencies. Tier 3 (architect/principal) handles systemic issues that require architectural changes — fundamental scaling limits, security breaches, protocol-level bugs. Escalation should happen based on time-in-tier, not gut feel: if Tier 1 cannot resolve within 15 minutes, escalate to Tier 2.

Build the on-call toolkit Create a single dashboard that shows: active sessions, tool invocation rate, error rate, p99 latency, deployment status, and recent alerts. The first responder should not need to check multiple systems.
Define escalation criteria Document specific triggers for each escalation tier: T1→T2 after 15 minutes unresolved or if the issue does not match any runbook. T2→T3 after 30 minutes or if the fix requires a code change.
Run incident simulations Monthly, inject a failure (kill an instance, expire all tokens, throttle the database) and have the on-call team respond as if it were real. Review the response and update runbooks.