MP-301a · Module 2

Caching & Memoization Middleware

3 min read

LLMs frequently call the same tool with the same arguments within a single conversation — especially lookup tools like "get customer by ID" or "search knowledge base." Without caching, each call hits your database or external API, adding latency and cost. A caching middleware stores results keyed by tool name + arguments hash, returning cached results for duplicate calls within a TTL window. This is pure overhead reduction with no behavioral change, as long as you choose the right TTL.

The TTL decision is a correctness trade-off. A 5-minute TTL on a customer lookup means the LLM might see stale data if the customer was updated between calls — acceptable for conversational latency but dangerous for transactional workflows. A 0-second TTL (deduplicate only within the same request) eliminates staleness risk while still preventing the most common waste: the LLM calling the same tool twice in rapid succession because it forgot it already has the result. For write operations, never cache — the middleware should pass them through unconditionally.

interface CacheEntry {
  result: ToolResult;
  expiresAt: number;
}

function withCache(
  toolName: string,
  handler: ToolHandler,
  ttlMs: number = 60_000,
): ToolHandler {
  const cache = new Map<string, CacheEntry>();

  return async (args) => {
    const key = JSON.stringify([toolName, args]);
    const now = Date.now();

    // Check cache
    const cached = cache.get(key);
    if (cached && cached.expiresAt > now) {
      console.error(JSON.stringify({
        event: "cache_hit", tool: toolName, ttlRemaining: cached.expiresAt - now,
      }));
      return cached.result;
    }

    // Execute and cache
    const result = await handler(args);

    // Only cache successful results
    if (!result.isError) {
      cache.set(key, { result, expiresAt: now + ttlMs });
    }

    // Evict expired entries periodically
    if (cache.size > 1000) {
      for (const [k, v] of cache) {
        if (v.expiresAt <= now) cache.delete(k);
      }
    }

    return result;
  };
}

Do This

  • Cache read-only tools with a TTL appropriate for data freshness requirements
  • Key the cache on tool name + full arguments to avoid collisions
  • Only cache successful responses — errors must be retryable
  • Add a size cap and eviction strategy to prevent unbounded memory growth

Avoid This

  • Cache write operations — each call must execute to have its side effect
  • Use a global TTL for all tools — a weather lookup and a stock price have different staleness tolerances
  • Cache error results — this turns transient failures into persistent ones
  • Skip eviction — an MCP server with unbounded cache will OOM under sustained use