KM-301g · Module 2

Naive RAG vs. Advanced RAG vs. Modular RAG

4 min read

Retrieval-Augmented Generation (RAG) is now the standard architecture for knowledge-grounded AI responses. But "RAG" covers a wide spectrum of implementation patterns with significantly different performance characteristics. Naive RAG is straightforward to implement and fails in predictable ways. Advanced RAG adds pre-retrieval and post-retrieval processing to address those failures. Modular RAG decomposes the entire pipeline into replaceable components for maximum optimization. The choice between them is a cost-benefit decision: naive RAG is cheap and sufficient for many use cases; modular RAG is expensive and necessary for high-stakes applications.

  1. Naive RAG Query → embed query → retrieve top-K chunks → concatenate into context → generate answer. Simple, fast, and suitable for low-stakes knowledge retrieval where approximate answers are acceptable. Failure modes: query-document semantic mismatch (the query's terminology does not match the document's terminology), context window overflow (top-K chunks exceed the generation model's context limit), and retrieval of irrelevant but superficially similar content.
  2. Advanced RAG: Pre-Retrieval Add processing before retrieval to improve query quality: query rewriting (rephrase the query to match the terminology used in the knowledge base), hypothetical document embedding (generate a hypothetical answer to the query and embed it — the embedding is more similar to relevant documents than the raw query), and query expansion (add related terms and synonyms to improve recall). Pre-retrieval processing is the most impactful investment for systems where the query terminology does not match the document terminology.
  3. Advanced RAG: Post-Retrieval Add processing after retrieval to improve context quality: re-ranking (use a cross-encoder to re-score retrieved chunks against the query and reorder them), compression (remove irrelevant sentences from retrieved chunks to maximize relevant content per context token), and deduplication (remove semantically identical chunks that would waste context window tokens). Post-retrieval processing is critical for systems where the generated answer quality degrades as context window fills.
  4. Modular RAG Decompose the pipeline into independently replaceable modules: query module, retrieval module, re-ranking module, fusion module, generation module. Each module can be optimized independently or replaced without rewriting the pipeline. Modular RAG is the production architecture for high-stakes knowledge systems where specific failure modes have been identified and must be addressed without compromising other components.

Do This

  • Start with naive RAG and measure its failure modes before implementing advanced patterns
  • Add pre-retrieval processing when query-document terminology mismatch is the primary failure mode
  • Add post-retrieval re-ranking when the most relevant content is consistently not in the top-3 retrieved chunks
  • Graduate to modular RAG when multiple failure modes require independent optimization

Avoid This

  • Implement modular RAG on day one — the complexity is only justified after the simpler patterns have been measured and found insufficient
  • Add more retrieval components without measuring whether they improve the target metric — complexity without measurement is waste
  • Assume the most expensive pattern is the most appropriate — naive RAG with excellent chunking often outperforms advanced RAG with poor chunking