KM-201c · Module 1

RAG Fundamentals: Retrieval-Augmented Generation for Enterprise Knowledge

5 min read

Retrieval-Augmented Generation (RAG) is the architectural pattern that has made AI-powered knowledge retrieval practical for enterprise use cases. The concept is straightforward: rather than asking a language model to answer questions from its training data alone (which may be outdated, incomplete, or simply not contain your organization's specific knowledge), you first retrieve the relevant sections of your knowledge base, then ask the language model to synthesize an answer from those retrieved sections. The model's job is synthesis, not recall. The retrieval system's job is surfacing the relevant source material.

This separation of concerns is the key insight. Language models are excellent synthesizers — they can take ten relevant passages and produce a coherent, accurate answer in natural language with appropriate context. They are unreliable memorizers — their recall of specific organizational facts is inconsistent, they hallucinate details that sound plausible, and their knowledge cutoff makes them wrong about recent developments. RAG plays to the model's strength and offsets its weakness.

  1. Step 1: User Query The user submits a natural language question: 'What is our standard SLA for P1 incidents during business hours?' The query may be processed before retrieval — expanded with synonyms, broken into sub-queries if complex, or classified to determine which knowledge domains to search.
  2. Step 2: Semantic Retrieval The query is converted into a vector embedding using the same embedding model used to index the knowledge base. The vector database returns the k most semantically similar document chunks, ranked by cosine similarity. Metadata filters may be applied: retrieve only from this category, only documents reviewed in the past 12 months, only documents with this audience tag. The retrieval step returns the raw source material — typically 3–10 document chunks.
  3. Step 3: Context Assembly The retrieved chunks are assembled into a context window for the language model. This includes the chunks themselves, their source metadata (document title, owner, last updated date), and potentially the query reformulation. The context window has a size limit — if the retrieved chunks exceed the limit, they are prioritized by relevance score. The assembly step is where retrieval quality problems become visible: if the retrieved chunks are not relevant, the model has nothing useful to synthesize from.
  4. Step 4: Synthesis and Response The language model receives the query and the assembled context and produces a natural language response. The prompt instructs the model to: answer only from the provided context, cite the source documents for each claim, indicate when the context does not contain enough information to answer fully, and flag apparent contradictions between source documents. The citation requirement is the critical quality gate — a response that cannot point to a source document for each claim is a hallucination risk.
  5. Step 5: Attribution Display The response is displayed to the user with citations to the source documents. The user can verify any claim against the source, navigate to the full source document for more context, or flag a response as inaccurate to trigger review. The attribution display is what distinguishes a trustworthy AI retrieval system from a black box — users trust sources they can verify.

Advanced RAG patterns address specific failure modes of basic RAG. Hybrid retrieval combines keyword search and semantic search — keyword search for exact-match queries (find the document that contains exactly this policy number), semantic search for conceptual queries (find documents about this topic). The combination outperforms either alone on diverse query types. Multi-hop retrieval chains retrieval steps: the first retrieval finds a document that references another relevant document, and the second retrieval fetches the referenced document. This handles the cross-document synthesis case where the answer requires combining two related but separately stored pieces of knowledge. Re-ranking applies a second, more computationally expensive relevance model to the initial retrieval results to reorder them before synthesis — this catches cases where semantic similarity is high but relevance to the specific question is low.