KM-201c · Module 3

Retrieval Quality Metrics: Precision, Recall, and Relevance

4 min read

You cannot improve what you do not measure. A knowledge retrieval system that is not producing the right answers for users is failing — but without measurement infrastructure, the failure is invisible until users stop using the system entirely. By then, the trust has been lost and rebuilding it requires both fixing the system and demonstrating that it has been fixed. Measurement infrastructure catches failures while they are correctable and provides the data to make targeted improvements.

Retrieval quality has three independent dimensions. Precision measures whether the retrieved chunks are relevant to the query — of the 5 chunks retrieved, how many were actually relevant? A high-precision retrieval returns mostly relevant results and few irrelevant ones. Recall measures whether the retrieval system found all the relevant knowledge that exists in the knowledge base — of all the documents that were relevant to the query, what percentage were retrieved? A high-recall retrieval misses few relevant documents. Relevance combines both — the overall user-facing measure of whether the answers the system provides are accurate, complete, and useful.

Precision Measurement Build a test set of 50–100 representative queries with known relevant documents for each query. Run the retrieval system against the test set. For each query, measure: what percentage of the retrieved chunks were in the known-relevant set? Track precision@k (precision at k retrieved results) for k=1, 3, 5 — the most important are precision@1 (is the top result relevant?) and precision@3 (are 3 of the top 3 results relevant?). Target: precision@3 above 0.75 for a production retrieval system.
Recall Measurement Using the same test set: for each query, what percentage of the known-relevant documents were returned in the top-k results? Recall is harder to measure than precision because it requires knowing all the relevant documents for a query, not just whether a returned result is relevant. Sample-based recall measurement is practical: select 30 queries, manually identify all relevant documents in the knowledge base for each, and measure the percentage retrieved. Target: recall@10 above 0.80 for a production system.
End-to-End Answer Quality Precision and recall measure the retrieval step. The synthesis step introduces additional quality dimensions: accuracy (is the synthesized answer factually correct based on the source material?), completeness (does the answer address all aspects of the question?), and citation accuracy (do the cited sources actually support the claims made?). Measure end-to-end answer quality on the test set by having domain experts rate answers on a 1–5 scale across these dimensions. This measurement catches failures in the synthesis step that precision and recall metrics miss.
Operational Metrics Beyond accuracy: retrieval latency (how long does a query take?), null response rate (what percentage of queries return no result?), and low-confidence rate (what percentage of queries return a low-confidence result?). The null response rate and low-confidence rate are the most actionable operational metrics — they identify knowledge gaps that are causing retrieval failures and should drive the knowledge capture backlog.

Do This

Build a test set of queries with known-relevant documents before deploying to production
Measure precision, recall, and end-to-end answer quality as separate dimensions
Track operational metrics (latency, null response rate) continuously in production
Use the test set to validate every retrieval architecture change

Avoid This

Evaluate retrieval quality by asking 'does it find things' without quantitative measurement
Use test queries written by the system builder — use real user queries that have historically failed
Treat the test set as a deployment checklist rather than ongoing quality infrastructure
Optimize for precision alone at the cost of recall — incomplete answers are a form of failure