MP-301c · Module 3
Failover & Health Checks
4 min read
A health check for an MCP server is a lightweight endpoint or probe that answers one question: can this server process tool calls right now? For HTTP-transport servers, this is a GET /health endpoint that returns 200 when healthy and 503 when degraded. For stdio servers behind a process manager, the health check is a periodic heartbeat — send a simple tool call (like tools/list) and verify a response arrives within the timeout. If the health check fails, the infrastructure layer (load balancer, process manager, container orchestrator) routes traffic to a healthy instance or restarts the failed one.
A useful health check tests more than "is the process alive." It validates the server's dependencies: can it connect to the database? Can it reach external APIs? Is the event loop responsive? A server that is technically running but cannot reach its database will accept tool calls and fail every one of them. A deep health check that verifies dependency connectivity catches this state and marks the server unhealthy before it starts failing tool calls. The trade-off is that deep health checks take longer and can themselves become a bottleneck — keep them under 500ms and cache the result for 5-10 seconds.
// Deep health check for HTTP-transport MCP server
import type { Request, Response } from "express";
interface HealthStatus {
status: "healthy" | "degraded" | "unhealthy";
uptime: number;
checks: Record<string, { ok: boolean; latencyMs: number; error?: string }>;
}
const startTime = Date.now();
let cachedHealth: HealthStatus | null = null;
let cacheExpiry = 0;
export async function healthHandler(_req: Request, res: Response) {
const now = Date.now();
if (cachedHealth && now < cacheExpiry) {
res.status(cachedHealth.status === "unhealthy" ? 503 : 200).json(cachedHealth);
return;
}
const checks: HealthStatus["checks"] = {};
// Database connectivity
const dbStart = performance.now();
try {
await pool.query("SELECT 1");
checks.database = { ok: true, latencyMs: Math.round(performance.now() - dbStart) };
} catch (err) {
checks.database = { ok: false, latencyMs: Math.round(performance.now() - dbStart), error: (err as Error).message };
}
// Event loop responsiveness
const loopStart = performance.now();
await new Promise(r => setImmediate(r));
const loopLag = Math.round(performance.now() - loopStart);
checks.eventLoop = { ok: loopLag < 100, latencyMs: loopLag };
// Determine overall status
const allOk = Object.values(checks).every(c => c.ok);
const anyOk = Object.values(checks).some(c => c.ok);
const status: HealthStatus = {
status: allOk ? "healthy" : anyOk ? "degraded" : "unhealthy",
uptime: Math.round((now - startTime) / 1000),
checks,
};
cachedHealth = status;
cacheExpiry = now + 5000; // Cache for 5 seconds
res.status(status.status === "unhealthy" ? 503 : 200).json(status);
}
- Implement a /health endpoint Return a JSON object with overall status, uptime, and per-dependency check results. Return 503 only when the server cannot process any tool calls.
- Test dependencies in the health check Verify database connectivity, external API reachability, and event loop responsiveness. Each check gets a pass/fail result and latency measurement.
- Cache the health result Recompute health every 5-10 seconds, not on every probe. This prevents the health check itself from becoming a performance bottleneck under frequent probing.