MP-201c · Module 3

Monitoring & Observability

4 min read

Health checks for MCP servers must go beyond "is the process running." A meaningful health check verifies that the server can accept connections, parse JSON-RPC, and reach its downstream dependencies. Implement a /health endpoint (for HTTP transport) or a periodic self-test (for stdio) that exercises the full request path. The health check should return degraded status when dependencies are slow, not just when they are completely down. A server that takes 30 seconds to respond to tool calls is effectively down for interactive use.

The four golden signals for MCP servers are latency (time from request to response), traffic (requests per second, broken down by tool), errors (failed tool invocations, malformed requests, auth failures), and saturation (CPU usage, memory usage, connection pool utilization). Track these per tool — a single slow tool should not hide behind healthy aggregate metrics. Set alerts on percentiles, not averages: p99 latency matters more than mean latency because it tells you how your worst-case users experience the server.

Distributed tracing connects the dots across the full request lifecycle: client sends request, load balancer routes it, MCP server receives it, tool handler executes, downstream API is called, response flows back. Propagate trace IDs through the Mcp-Session-Id or a custom header so you can reconstruct the complete journey of any request. When a user reports "the tool was slow," the trace ID lets you pinpoint exactly where the time was spent — was it the MCP server, the downstream API, or the network in between?

// Structured health check for MCP servers
interface HealthStatus {
  status: "healthy" | "degraded" | "unhealthy";
  timestamp: string;
  checks: Record<string, {
    status: "pass" | "warn" | "fail";
    latency_ms: number;
    message?: string;
  }>;
}

async function healthCheck(): Promise<HealthStatus> {
  const checks: HealthStatus["checks"] = {};

  // Check JSON-RPC parsing
  const rpcStart = Date.now();
  try {
    JSON.parse('{"jsonrpc":"2.0","method":"ping","id":1}');
    checks.jsonrpc = { status: "pass", latency_ms: Date.now() - rpcStart };
  } catch {
    checks.jsonrpc = { status: "fail", latency_ms: Date.now() - rpcStart };
  }

  // Check downstream database
  const dbStart = Date.now();
  try {
    await db.query("SELECT 1");
    const ms = Date.now() - dbStart;
    checks.database = {
      status: ms > 1000 ? "warn" : "pass",
      latency_ms: ms,
      message: ms > 1000 ? "Database responding slowly" : undefined,
    };
  } catch (err) {
    checks.database = {
      status: "fail",
      latency_ms: Date.now() - dbStart,
      message: String(err),
    };
  }

  const overall = Object.values(checks).some(c => c.status === "fail")
    ? "unhealthy"
    : Object.values(checks).some(c => c.status === "warn")
      ? "degraded"
      : "healthy";

  return { status: overall, timestamp: new Date().toISOString(), checks };
}