KM-301g · Module 3

Benchmark Datasets for Retrieval Systems

3 min read

There are two types of evaluation dataset relevant to enterprise RAG systems: public benchmarks that provide a comparative baseline, and domain-specific datasets that measure performance on the actual knowledge base and query types the system will serve. Public benchmarks tell you whether your system is competitive with the state of the art. Domain datasets tell you whether your system is good enough for your specific use case. You need both, and you need to understand what each one measures.

Public Benchmarks BEIR (Benchmarking Information Retrieval): a heterogeneous benchmark covering 18 retrieval tasks across diverse domains. Use it to evaluate embedding model selection and establish a baseline. HotpotQA: multi-hop question answering requiring retrieval across multiple documents. Use it to evaluate systems that must synthesize information across documents. MS MARCO: large-scale passage retrieval benchmark. Use it to evaluate re-ranking components. Each benchmark tests a specific retrieval capability — use the one that matches your primary use case.
Domain Dataset Construction Build a domain evaluation dataset by sampling your actual queries (or expected query types), identifying the correct answers, and labeling the source chunks. Minimum 50 examples for initial evaluation, 200+ for statistically reliable comparison between configurations. Refresh the dataset when the knowledge base changes significantly. A stale evaluation dataset produces misleading metrics for a current system.
Synthetic Dataset Generation For knowledge bases where real query samples are not available, use an LLM to generate synthetic evaluation questions from the source documents. Prompt: "Given this document excerpt, generate three questions whose answer is contained in this excerpt." The generated questions are not as representative as real queries but are far better than no evaluation dataset. Validate a sample of generated questions manually before using them for evaluation.