KM-301g · Module 3

Retrieval Evaluation Frameworks

4 min read

Most teams evaluate RAG systems by asking the system questions and judging whether the answers seem right. This is not evaluation — it is impressionism. It is sensitive to the questions chosen, biased toward answers that sound confident, and completely unable to identify which component of the pipeline is responsible for poor performance. Rigorous retrieval evaluation separates retrieval quality from generation quality, measures both with specific metrics, and produces a causal diagnosis of every performance failure.

Retrieval Metrics Measure retrieval independently from generation. Recall@K: what fraction of relevant documents appear in the top-K retrieved results? Precision@K: what fraction of the top-K retrieved results are relevant? MRR (Mean Reciprocal Rank): where does the first relevant document appear in the ranking? NDCG (Normalized Discounted Cumulative Gain): accounts for the graded relevance of retrieved documents at each rank position. These metrics require a labeled evaluation dataset — the prerequisite for all meaningful retrieval evaluation.
Generation Metrics Measure generation quality independently from retrieval. Faithfulness: is the generated answer grounded in the retrieved context (no hallucination)? Answer relevance: does the generated answer address the question? Context precision: does the retrieved context contain information needed to answer the question? Context recall: does the retrieved context contain all information needed to answer the question fully? Frameworks like RAGAS automate these measurements against ground truth.
End-to-End Evaluation Combine retrieval and generation metrics to produce an end-to-end quality score. More importantly, cross-reference the metrics to identify failure causality: if faithfulness is low but context precision is high, the generation model is hallucinating despite good retrieval — a generation problem. If context precision is low but recall is high, the re-ranking is poor — a retrieval pipeline problem. If recall is low, the embedding or chunking is the problem.
Evaluation Dataset Construction Build an evaluation dataset from your actual knowledge base: 50–200 question-answer pairs covering the query types your system will face. Include simple factual questions, multi-hop reasoning questions, comparative questions, and edge cases. Label each question with the specific chunks that contain the answer. This dataset is the ground truth against which every retrieval metric is computed. Without it, you have no measurement capability.