KM-301g · Module 2

Context Window Management

3 min read

The context window is the finite resource that determines how much retrieved knowledge can inform a generation. Every byte of context that is irrelevant is a byte not available for relevant content. Context window management is the discipline of maximizing the signal-to-noise ratio in the context that reaches the generation model — ensuring that the most relevant content occupies the most available space, that irrelevant content is filtered out, and that the context structure guides the generation model toward the most important information.

Context Budget Allocation Define the context budget for each component: system prompt, retrieved chunks, conversation history, and generation reserve. A typical allocation for a 32K-token context window: 1K for system prompt, 20K for retrieved chunks, 8K for conversation history, 3K generation reserve. The allocation is a design decision — not a default. Different use cases have different optimal allocations.
Lost in the Middle Problem Research consistently shows that generation models pay disproportionate attention to content at the beginning and end of the context window, with degraded attention to content in the middle. For RAG systems, this means the most relevant chunks should be placed at the beginning of the retrieved context, not ordered by similarity score from top to bottom. The highest-scoring chunk goes first. The second-highest chunk goes last. The rest fills in between.
Dynamic Context Sizing Not every query requires the same amount of context. Simple factual queries need one or two chunks. Complex analytical queries need ten. Dynamic context sizing adjusts the number of retrieved chunks based on query complexity — a classifier or the LLM itself can estimate whether the query requires broad or narrow retrieval. Static top-K retrieval wastes context on simple queries and starves complex ones.