MP-301c · Module 3

Failover & Health Checks

4 min read

A health check for an MCP server is a lightweight endpoint or probe that answers one question: can this server process tool calls right now? For HTTP-transport servers, this is a GET /health endpoint that returns 200 when healthy and 503 when degraded. For stdio servers behind a process manager, the health check is a periodic heartbeat — send a simple tool call (like tools/list) and verify a response arrives within the timeout. If the health check fails, the infrastructure layer (load balancer, process manager, container orchestrator) routes traffic to a healthy instance or restarts the failed one.

A useful health check tests more than "is the process alive." It validates the server's dependencies: can it connect to the database? Can it reach external APIs? Is the event loop responsive? A server that is technically running but cannot reach its database will accept tool calls and fail every one of them. A deep health check that verifies dependency connectivity catches this state and marks the server unhealthy before it starts failing tool calls. The trade-off is that deep health checks take longer and can themselves become a bottleneck — keep them under 500ms and cache the result for 5-10 seconds.

// Deep health check for HTTP-transport MCP server
import type { Request, Response } from "express";

interface HealthStatus {
  status: "healthy" | "degraded" | "unhealthy";
  uptime: number;
  checks: Record<string, { ok: boolean; latencyMs: number; error?: string }>;
}

const startTime = Date.now();
let cachedHealth: HealthStatus | null = null;
let cacheExpiry = 0;

export async function healthHandler(_req: Request, res: Response) {
  const now = Date.now();
  if (cachedHealth && now < cacheExpiry) {
    res.status(cachedHealth.status === "unhealthy" ? 503 : 200).json(cachedHealth);
    return;
  }

  const checks: HealthStatus["checks"] = {};

  // Database connectivity
  const dbStart = performance.now();
  try {
    await pool.query("SELECT 1");
    checks.database = { ok: true, latencyMs: Math.round(performance.now() - dbStart) };
  } catch (err) {
    checks.database = { ok: false, latencyMs: Math.round(performance.now() - dbStart), error: (err as Error).message };
  }

  // Event loop responsiveness
  const loopStart = performance.now();
  await new Promise(r => setImmediate(r));
  const loopLag = Math.round(performance.now() - loopStart);
  checks.eventLoop = { ok: loopLag < 100, latencyMs: loopLag };

  // Determine overall status
  const allOk = Object.values(checks).every(c => c.ok);
  const anyOk = Object.values(checks).some(c => c.ok);
  const status: HealthStatus = {
    status: allOk ? "healthy" : anyOk ? "degraded" : "unhealthy",
    uptime: Math.round((now - startTime) / 1000),
    checks,
  };

  cachedHealth = status;
  cacheExpiry = now + 5000; // Cache for 5 seconds

  res.status(status.status === "unhealthy" ? 503 : 200).json(status);
}

Implement a /health endpoint Return a JSON object with overall status, uptime, and per-dependency check results. Return 503 only when the server cannot process any tool calls.
Test dependencies in the health check Verify database connectivity, external API reachability, and event loop responsiveness. Each check gets a pass/fail result and latency measurement.
Cache the health result Recompute health every 5-10 seconds, not on every probe. This prevents the health check itself from becoming a performance bottleneck under frequent probing.