KM-301g · Module 1

How Retrieval Quality Is Determined Before a Query Runs

3 min read

Every retrieval quality problem I have diagnosed in enterprise knowledge systems was a pre-query problem. Not a bad query, not a bad prompt, not a bad generation model. The knowledge base was indexed in a way that made retrieval of the relevant content structurally impossible. By the time the user submits a query, the retrieval quality ceiling has already been set. Understanding what sets that ceiling — and how to raise it — is the practical application of everything in this module.

Metadata Quality Every chunk should carry rich metadata: source document title, date, author, section, document type, and relevant categorical tags. Metadata enables filtered retrieval — "find the most relevant chunk from documents published after 2024" — that pure vector similarity cannot provide. A knowledge base without metadata is a knowledge base that can only search content, not context.
Index Freshness The knowledge base is only as current as its last index update. Documents added to the source system but not re-indexed are invisible to retrieval. Define the maximum acceptable staleness for your knowledge base and build an automated re-indexing pipeline that maintains it. For operational knowledge bases, staleness of more than 24 hours is typically unacceptable.
Corpus Quality Garbage in, garbage out. A knowledge base indexed from poorly written, inconsistently formatted, or factually outdated source documents will have systematically poor retrieval quality regardless of the architecture. Corpus quality is a prerequisite for retrieval quality. Run a corpus audit before indexing — identify and remediate low-quality documents rather than indexing and retrieving them.