MP-301a · Module 2
Caching & Memoization Middleware
3 min read
LLMs frequently call the same tool with the same arguments within a single conversation — especially lookup tools like "get customer by ID" or "search knowledge base." Without caching, each call hits your database or external API, adding latency and cost. A caching middleware stores results keyed by tool name + arguments hash, returning cached results for duplicate calls within a TTL window. This is pure overhead reduction with no behavioral change, as long as you choose the right TTL.
The TTL decision is a correctness trade-off. A 5-minute TTL on a customer lookup means the LLM might see stale data if the customer was updated between calls — acceptable for conversational latency but dangerous for transactional workflows. A 0-second TTL (deduplicate only within the same request) eliminates staleness risk while still preventing the most common waste: the LLM calling the same tool twice in rapid succession because it forgot it already has the result. For write operations, never cache — the middleware should pass them through unconditionally.
interface CacheEntry {
result: ToolResult;
expiresAt: number;
}
function withCache(
toolName: string,
handler: ToolHandler,
ttlMs: number = 60_000,
): ToolHandler {
const cache = new Map<string, CacheEntry>();
return async (args) => {
const key = JSON.stringify([toolName, args]);
const now = Date.now();
// Check cache
const cached = cache.get(key);
if (cached && cached.expiresAt > now) {
console.error(JSON.stringify({
event: "cache_hit", tool: toolName, ttlRemaining: cached.expiresAt - now,
}));
return cached.result;
}
// Execute and cache
const result = await handler(args);
// Only cache successful results
if (!result.isError) {
cache.set(key, { result, expiresAt: now + ttlMs });
}
// Evict expired entries periodically
if (cache.size > 1000) {
for (const [k, v] of cache) {
if (v.expiresAt <= now) cache.delete(k);
}
}
return result;
};
}
Do This
- Cache read-only tools with a TTL appropriate for data freshness requirements
- Key the cache on tool name + full arguments to avoid collisions
- Only cache successful responses — errors must be retryable
- Add a size cap and eviction strategy to prevent unbounded memory growth
Avoid This
- Cache write operations — each call must execute to have its side effect
- Use a global TTL for all tools — a weather lookup and a stock price have different staleness tolerances
- Cache error results — this turns transient failures into persistent ones
- Skip eviction — an MCP server with unbounded cache will OOM under sustained use