KM-301g · Module 2

Re-Ranking: Why First Retrieval Is Rarely Best Retrieval

3 min read

Bi-encoder retrieval — the standard approach in most RAG systems — scores each document independently against the query. It is fast and scales to millions of documents. But it makes a systematic error: it cannot capture the interaction between the query and the document content at the level of specificity that determines true relevance. Cross-encoder re-ranking corrects this by processing the query and candidate document together in a single pass — producing a more accurate relevance score at the cost of higher latency. Re-ranking is the single most impactful post-retrieval improvement in most production RAG systems.

Bi-Encoder vs. Cross-Encoder Bi-encoder: encode query and document separately, compare vectors. Fast (pre-computed document embeddings, single query encoding). Approximate (no direct query-document interaction). Cross-encoder: encode query and document together, output a single relevance score. Slow (cannot pre-compute). Precise (full query-document interaction). The standard pattern: bi-encoder retrieves top-50 candidates, cross-encoder re-ranks to produce the final top-5.
Re-Ranking Implementation Retrieve a larger candidate set than needed (top-50 instead of top-5). Pass each candidate with the original query to the cross-encoder. Re-rank by the cross-encoder score. Use the top-K of the re-ranked set as the final context. The additional latency from cross-encoder re-ranking is typically 200–500ms — acceptable for most knowledge retrieval applications and justified by the retrieval quality improvement.
When Re-Ranking Matters Most Re-ranking shows the greatest improvement when: the knowledge base contains many topically similar documents that bi-encoder scoring cannot distinguish, the queries are specific factual questions where the exact answer exists in one document but not the others, and when the generation model produces inconsistent answers due to context quality variation. If retrieval is already surfacing the correct document consistently, re-ranking adds latency without adding value.

"""
Re-ranking pattern: bi-encoder retrieval + cross-encoder re-rank.
Retrieves top-50 candidates, re-ranks to top-5.
"""

from sentence_transformers import CrossEncoder
from typing import List, Tuple

# Cross-encoder for re-ranking
# Replace with domain-tuned model for best results
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")


def retrieve_and_rerank(
    query: str,
    vector_store,
    top_k_retrieve: int = 50,
    top_k_final: int = 5,
) -> List[dict]:
    """
    Two-stage retrieval:
    1. Bi-encoder retrieval of top_k_retrieve candidates
    2. Cross-encoder re-ranking to top_k_final
    """
    # Stage 1: Bi-encoder retrieval (fast, approximate)
    candidates = vector_store.similarity_search(
        query=query,
        k=top_k_retrieve,
    )

    if not candidates:
        return []

    # Stage 2: Cross-encoder re-ranking (slow, precise)
    pairs: List[Tuple[str, str]] = [
        (query, doc.page_content) for doc in candidates
    ]
    scores = reranker.predict(pairs)

    # Sort by cross-encoder score descending
    ranked = sorted(
        zip(candidates, scores),
        key=lambda x: x[1],
        reverse=True,
    )

    # Return top_k_final with scores attached
    results = []
    for doc, score in ranked[:top_k_final]:
        results.append({
            "content": doc.page_content,
            "metadata": doc.metadata,
            "rerank_score": float(score),
        })

    return results