KM-301g · Module 1
How Retrieval Quality Is Determined Before a Query Runs
3 min read
Every retrieval quality problem I have diagnosed in enterprise knowledge systems was a pre-query problem. Not a bad query, not a bad prompt, not a bad generation model. The knowledge base was indexed in a way that made retrieval of the relevant content structurally impossible. By the time the user submits a query, the retrieval quality ceiling has already been set. Understanding what sets that ceiling — and how to raise it — is the practical application of everything in this module.
- Metadata Quality Every chunk should carry rich metadata: source document title, date, author, section, document type, and relevant categorical tags. Metadata enables filtered retrieval — "find the most relevant chunk from documents published after 2024" — that pure vector similarity cannot provide. A knowledge base without metadata is a knowledge base that can only search content, not context.
- Index Freshness The knowledge base is only as current as its last index update. Documents added to the source system but not re-indexed are invisible to retrieval. Define the maximum acceptable staleness for your knowledge base and build an automated re-indexing pipeline that maintains it. For operational knowledge bases, staleness of more than 24 hours is typically unacceptable.
- Corpus Quality Garbage in, garbage out. A knowledge base indexed from poorly written, inconsistently formatted, or factually outdated source documents will have systematically poor retrieval quality regardless of the architecture. Corpus quality is a prerequisite for retrieval quality. Run a corpus audit before indexing — identify and remediate low-quality documents rather than indexing and retrieving them.