KM-301g · Module 3

The Retrieval Improvement Loop

3 min read

Retrieval quality improvement is not a one-time optimization — it is a continuous engineering loop. The system is deployed, real queries accumulate, failure cases are identified, root causes are diagnosed, components are updated, and the evaluation dataset is expanded with the cases that revealed the failures. This loop is what separates a knowledge system that degrades over time from one that improves. The loop requires infrastructure: query logging, failure classification, and a deployment pipeline that allows component updates without service interruption.

Query Logging and Failure Detection Log every query, the retrieved chunks, the generated answer, and any explicit user feedback. For systems without explicit feedback, use implicit signals: did the user ask a follow-up question that suggests the first answer was insufficient? Did the user refine the query? Log these signals as soft negative feedback. The query log is the raw material for the improvement loop.
Failure Classification Classify failures into their root cause categories: retrieval recall failure (the correct chunk was not retrieved), retrieval precision failure (irrelevant chunks dominated the context), context window failure (the relevant chunk was retrieved but was too low in the context window), generation faithfulness failure (the model hallucinated despite good retrieval), or knowledge gap failure (the answer does not exist in the knowledge base). Each category has a different remediation path.
Component-Level Remediation For each failure category, apply the specific fix: recall failures → improve chunking or embedding model. Precision failures → improve re-ranking or add metadata filtering. Context window failures → apply lost-in-the-middle positioning fix. Generation faithfulness failures → add faithfulness constraints to the prompt. Knowledge gap failures → identify and add missing content to the knowledge base.
Evaluation Dataset Expansion Add every classified failure case to the evaluation dataset. The evaluation dataset grows to reflect the actual failure modes encountered in production. After each improvement cycle, run the full evaluation dataset to confirm the fix resolved the targeted failures without degrading performance on previously passing cases. The evaluation dataset is the regression test suite for the retrieval system.

Do This

Log every query and every retrieval result from day one — you cannot analyze what you did not capture
Classify failures by root cause before attempting remediation — the fix must match the failure type
Expand the evaluation dataset with every new failure type — the dataset should grow as the system encounters more of the real-world query distribution
Run the full evaluation dataset after every change — the improvement loop must not introduce regressions

Avoid This

Make retrieval changes based on qualitative feedback without measurement — you will optimize for one visible failure while introducing invisible ones
Assume a fix that improves one failure category does not degrade another — test the full evaluation dataset every time
Stop the improvement loop after initial deployment — retrieval quality degrades as the knowledge base grows and query patterns change