KM-301g · Module 1
Chunking Strategies
4 min read
Chunking is the process of splitting source documents into units that are embedded and indexed separately. It is the most underestimated variable in retrieval system design. A bad chunking strategy — chunks too large, too small, or cut at semantically wrong boundaries — will degrade retrieval quality regardless of how good the embedding model is. The rule: a chunk should contain one complete idea, no more, no less. A chunk that contains three ideas confuses the embedding. A chunk that contains half an idea loses the context that makes it useful.
- Fixed-Size Chunking Split documents into chunks of N tokens (typically 256–512) with or without overlap. Simplest to implement. Worst semantic fidelity — the chunk boundary is determined by token count, not by idea boundaries. Often cuts mid-sentence, mid-paragraph, or mid-argument. Appropriate for: highly uniform, structured text where semantic boundaries align with token count. Inappropriate for: narrative documents, technical documentation, or any content where idea length varies significantly.
- Semantic Chunking Split at semantic boundaries: paragraph breaks, section headers, logical transitions. Produces chunks that contain complete ideas. More complex to implement — requires heuristic rules or a trained classifier to identify semantic boundaries. Best practice: use heading/paragraph structure as primary boundaries, set minimum and maximum chunk size constraints to prevent empty chunks and chunks that are too large to embed accurately.
- Hierarchical Chunking Index documents at multiple granularities simultaneously: document level, section level, paragraph level. Each query can retrieve at the most appropriate granularity. A question about a specific fact retrieves a paragraph. A question about an overall approach retrieves a section. A question about an organization's policy retrieves a document summary. Hierarchical chunking requires more infrastructure but produces significantly better recall on queries of varying scope.
- Overlap and Context Preservation Regardless of chunking strategy, include 10–20% overlap between adjacent chunks. Overlap preserves the context that a sentence at the boundary of a chunk depends on. Without overlap, a sentence whose meaning requires the prior sentence is retrieved without its context and the embedding is degraded. Overlap is a simple addition that consistently improves retrieval quality at moderate storage cost.
"""
Semantic chunking with overlap — production pattern.
Splits on section/paragraph boundaries, applies overlap,
enforces min/max token constraints.
"""
from typing import List
import tiktoken
ENCODING = tiktoken.get_encoding("cl100k_base")
MIN_TOKENS = 50
MAX_TOKENS = 512
OVERLAP_TOKENS = 64
def count_tokens(text: str) -> int:
return len(ENCODING.encode(text))
def chunk_document(text: str) -> List[str]:
"""
Chunk a document at semantic boundaries (paragraphs),
enforce min/max token constraints, and add overlap
between adjacent chunks.
"""
# Split at semantic boundaries
paragraphs = [p.strip() for p in text.split("\n\n") if p.strip()]
chunks = []
current_chunk = []
current_tokens = 0
for para in paragraphs:
para_tokens = count_tokens(para)
# Para alone exceeds max: force-split it
if para_tokens > MAX_TOKENS:
if current_chunk:
chunks.append(" ".join(current_chunk))
current_chunk = []
current_tokens = 0
# Fixed-size fallback for oversized paragraphs
tokens = ENCODING.encode(para)
for i in range(0, len(tokens), MAX_TOKENS - OVERLAP_TOKENS):
chunk_tokens = tokens[i:i + MAX_TOKENS]
chunks.append(ENCODING.decode(chunk_tokens))
continue
# Adding para would exceed max: flush current chunk
if current_tokens + para_tokens > MAX_TOKENS and current_chunk:
chunk_text = " ".join(current_chunk)
if count_tokens(chunk_text) >= MIN_TOKENS:
chunks.append(chunk_text)
# Start new chunk with overlap from end of previous
overlap_text = _get_overlap(current_chunk)
current_chunk = [overlap_text, para] if overlap_text else [para]
current_tokens = count_tokens(" ".join(current_chunk))
else:
current_chunk.append(para)
current_tokens += para_tokens
if current_chunk:
chunk_text = " ".join(current_chunk)
if count_tokens(chunk_text) >= MIN_TOKENS:
chunks.append(chunk_text)
return chunks
def _get_overlap(chunk_parts: List[str]) -> str:
"""Return last N tokens of current chunk as overlap context."""
full_text = " ".join(chunk_parts)
tokens = ENCODING.encode(full_text)
overlap = tokens[-OVERLAP_TOKENS:]
return ENCODING.decode(overlap) if len(overlap) > 10 else ""