MP-301a · Module 2

Rate Limiting & Retry Wrappers

4 min read

Rate limiting in MCP servers protects two things: downstream APIs from being overwhelmed by rapid-fire tool calls, and your server from abusive or misconfigured clients. The LLM does not inherently respect rate limits — if a task requires 50 customer lookups, it will fire all 50 as fast as the protocol allows. Your rate-limiting middleware acts as a governor, queueing excess requests and returning throttle errors when the queue is full. Unlike HTTP 429 responses, MCP rate-limit errors go back to the LLM as tool errors, so your error message should include the retry-after delay.

Retry middleware wraps external API calls with exponential backoff and jitter. The pattern: catch transient errors (network timeouts, 5xx responses, rate-limit responses), wait an increasing delay with random jitter, and retry up to a configurable maximum. The jitter prevents thundering herds when multiple fan-out sub-operations hit the same rate limit simultaneously. Without jitter, all retries fire at the same time and hit the same limit again.

The interaction between rate limiting and retry is subtle. If your rate limiter rejects a request and your retry middleware catches the rejection, you get an infinite loop: reject → retry → reject → retry. The solution is to classify rate-limit rejections as non-retryable at the middleware boundary. Retry only wraps errors from the downstream service, not from your own rate limiter. This requires error taxonomy — different error types for "downstream failed" vs. "rate limited locally."

// Token bucket rate limiter
class TokenBucket {
  private tokens: number;
  private lastRefill: number;

  constructor(
    private maxTokens: number,
    private refillRate: number, // tokens per second
  ) {
    this.tokens = maxTokens;
    this.lastRefill = Date.now();
  }

  tryConsume(): boolean {
    this.refill();
    if (this.tokens >= 1) {
      this.tokens -= 1;
      return true;
    }
    return false;
  }

  private refill() {
    const now = Date.now();
    const elapsed = (now - this.lastRefill) / 1000;
    this.tokens = Math.min(this.maxTokens, this.tokens + elapsed * this.refillRate);
    this.lastRefill = now;
  }
}

function withRateLimit(
  toolName: string,
  handler: ToolHandler,
  rps: number = 10,
): ToolHandler {
  const bucket = new TokenBucket(rps, rps);

  return async (args) => {
    if (!bucket.tryConsume()) {
      return {
        content: [{
          type: "text",
          text: `Rate limit exceeded for ${toolName}. ` +
            `Max ${rps} calls/second. Retry after 1 second. ` +
            `Consider batching multiple IDs into a single call if available.`,
        }],
        isError: true,
      };
    }
    return handler(args);
  };
}

// Retry with exponential backoff + jitter
function withRetry(
  handler: ToolHandler,
  maxRetries: number = 3,
): ToolHandler {
  return async (args) => {
    for (let attempt = 0; attempt <= maxRetries; attempt++) {
      try {
        return await handler(args);
      } catch (err) {
        if (attempt === maxRetries || !isRetryable(err)) throw err;
        const delayMs = Math.min(1000 * 2 ** attempt, 10_000);
        const jitter = Math.random() * delayMs * 0.3;
        await new Promise(r => setTimeout(r, delayMs + jitter));
      }
    }
    throw new Error("unreachable");
  };
}
  1. Choose a rate-limit algorithm Token bucket for bursty workloads (allows short bursts above average), sliding window for strict per-second enforcement. Token bucket is simpler and usually the right default.
  2. Classify errors for retry Define an isRetryable function that returns true for network errors and 5xx responses, false for 4xx, validation failures, and local rate-limit rejections. Never retry non-idempotent operations.
  3. Tune backoff parameters Start with base delay 1s, max delay 10s, max retries 3, jitter 30%. Monitor actual retry rates — if most requests need 3 retries, your rate limit is too aggressive.