MP-201c · Module 3

Load Balancing & Scaling

4 min read

Scaling MCP servers horizontally is straightforward when the server is stateless — every request is self-contained, any instance can handle any request, and a standard load balancer (round-robin, least-connections) distributes traffic evenly. Stateless design means no in-memory session state, no local file caches that other instances need, and no assumption that consecutive requests hit the same instance. If your MCP server meets these criteria, scaling is just adding more instances behind the load balancer.

Stateful MCP servers — those that maintain session context, conversation history, or tool state between requests — introduce the sticky session problem. The load balancer must route all requests with the same Mcp-Session-Id to the same backend instance. This works until that instance fails, at which point the session is lost. The better approach is externalized state: store session data in Redis or a database, so any instance can resume any session. This converts a stateful server into a stateless one from the load balancer's perspective.

Connection pooling matters for MCP servers that maintain persistent connections to downstream resources (databases, APIs, other MCP servers). Each instance needs its own connection pool, and the total connections across all instances must not exceed the downstream resource's limits. A common failure mode: auto-scaling adds 10 new instances, each opens 20 database connections, and the database hits its 200-connection limit. Set per-instance pool sizes based on the maximum instance count, not the current count.

Audit for state Search your MCP server for in-memory maps, global variables, and local file writes that persist between requests. Each one is a scaling obstacle. Move them to Redis or a database.
Configure connection limits Calculate: (max instances) x (connections per instance) must be less than downstream connection limits. Set pool sizes accordingly and add connection timeout handling.
Test with chaos Run multiple instances behind a load balancer. Kill one instance mid-session. Verify the client reconnects and the session resumes on a different instance without data loss.