KM-301g · Module 1
Vector Stores & Embedding Models
4 min read
Semantic search works by converting text into numerical vectors — dense representations in a high-dimensional space where similar meanings cluster near each other. When you query a semantic search system, your query is converted to a vector and the system finds the documents whose vectors are most similar to the query vector. The quality of this process depends almost entirely on two factors: how accurately the embedding model captures meaning in vector form, and how well the vector store retrieves the nearest neighbors. Everything downstream — the prompts, the generation, the answer quality — is bounded by what retrieval can find.
- Embedding Models: What They Do An embedding model transforms text into a fixed-length numerical vector that encodes semantic meaning. "The contract was terminated early" and "the agreement ended before its scheduled conclusion" should produce vectors that are geometrically close because they mean the same thing. The quality of this representation — how accurately it preserves semantic relationships — determines retrieval precision. No downstream component compensates for a poor embedding model.
- Embedding Model Selection Criteria Three dimensions: domain fit (is the model trained on data similar to your corpus?), context window (what is the maximum token length the model can embed as a single unit?), and dimensionality/performance trade-off (higher dimensions capture more information but cost more in storage and query time). For enterprise knowledge bases, domain-specific models consistently outperform general-purpose models. For multilingual corpora, multilingual models are required — general English models do not transfer.
- Vector Stores: Options and Trade-offs Vector stores index and retrieve embeddings at query time. Key dimensions: Pinecone (managed, high performance, low operational overhead, no hybrid search in base tier), Weaviate (open-source with hybrid search, more operational complexity), pgvector (PostgreSQL extension — no new infrastructure if you are already on PostgreSQL, lower scale ceiling), Chroma (lightweight, good for development and small corpora). Select based on scale requirements, operational capacity, and whether hybrid search (keyword + vector) is required.
- The Retrieval Quality Ceiling Every retrieval architecture has a quality ceiling — the maximum retrieval precision achievable given the embedding model and vector store configuration. The ceiling is set by: embedding model semantic accuracy, chunk size relative to query granularity, and the distance metric used for similarity scoring (cosine similarity for most NLP applications, dot product for models trained with it). The ceiling cannot be raised by better prompting or better generation. It can only be raised by improving the retrieval layer.
Do This
- Evaluate embedding models on your actual corpus domain before committing — general benchmarks do not predict performance on domain-specific text
- Select vector store based on operational capacity and scale requirements, not just performance benchmarks
- Use cosine similarity as the default distance metric unless your embedding model documentation specifies otherwise
- Benchmark retrieval quality independently from generation quality — they are different problems with different root causes
Avoid This
- Assume the best-performing general benchmark model is the best model for your specific domain
- Select a vector store based on a blog post comparison from two years ago — the landscape has changed significantly
- Use dot product similarity with models trained for cosine similarity — the results will be systematically wrong
- Diagnose poor answer quality as a generation problem before ruling out retrieval quality as the cause