KM-301g · Module 1

Chunking Strategies

4 min read

Chunking is the process of splitting source documents into units that are embedded and indexed separately. It is the most underestimated variable in retrieval system design. A bad chunking strategy — chunks too large, too small, or cut at semantically wrong boundaries — will degrade retrieval quality regardless of how good the embedding model is. The rule: a chunk should contain one complete idea, no more, no less. A chunk that contains three ideas confuses the embedding. A chunk that contains half an idea loses the context that makes it useful.

Fixed-Size Chunking Split documents into chunks of N tokens (typically 256–512) with or without overlap. Simplest to implement. Worst semantic fidelity — the chunk boundary is determined by token count, not by idea boundaries. Often cuts mid-sentence, mid-paragraph, or mid-argument. Appropriate for: highly uniform, structured text where semantic boundaries align with token count. Inappropriate for: narrative documents, technical documentation, or any content where idea length varies significantly.
Semantic Chunking Split at semantic boundaries: paragraph breaks, section headers, logical transitions. Produces chunks that contain complete ideas. More complex to implement — requires heuristic rules or a trained classifier to identify semantic boundaries. Best practice: use heading/paragraph structure as primary boundaries, set minimum and maximum chunk size constraints to prevent empty chunks and chunks that are too large to embed accurately.
Hierarchical Chunking Index documents at multiple granularities simultaneously: document level, section level, paragraph level. Each query can retrieve at the most appropriate granularity. A question about a specific fact retrieves a paragraph. A question about an overall approach retrieves a section. A question about an organization's policy retrieves a document summary. Hierarchical chunking requires more infrastructure but produces significantly better recall on queries of varying scope.
Overlap and Context Preservation Regardless of chunking strategy, include 10–20% overlap between adjacent chunks. Overlap preserves the context that a sentence at the boundary of a chunk depends on. Without overlap, a sentence whose meaning requires the prior sentence is retrieved without its context and the embedding is degraded. Overlap is a simple addition that consistently improves retrieval quality at moderate storage cost.

"""
Semantic chunking with overlap — production pattern.
Splits on section/paragraph boundaries, applies overlap,
enforces min/max token constraints.
"""

from typing import List
import tiktoken

ENCODING = tiktoken.get_encoding("cl100k_base")
MIN_TOKENS = 50
MAX_TOKENS = 512
OVERLAP_TOKENS = 64


def count_tokens(text: str) -> int:
    return len(ENCODING.encode(text))


def chunk_document(text: str) -> List[str]:
    """
    Chunk a document at semantic boundaries (paragraphs),
    enforce min/max token constraints, and add overlap
    between adjacent chunks.
    """
    # Split at semantic boundaries
    paragraphs = [p.strip() for p in text.split("\n\n") if p.strip()]

    chunks = []
    current_chunk = []
    current_tokens = 0

    for para in paragraphs:
        para_tokens = count_tokens(para)

        # Para alone exceeds max: force-split it
        if para_tokens > MAX_TOKENS:
            if current_chunk:
                chunks.append(" ".join(current_chunk))
                current_chunk = []
                current_tokens = 0
            # Fixed-size fallback for oversized paragraphs
            tokens = ENCODING.encode(para)
            for i in range(0, len(tokens), MAX_TOKENS - OVERLAP_TOKENS):
                chunk_tokens = tokens[i:i + MAX_TOKENS]
                chunks.append(ENCODING.decode(chunk_tokens))
            continue

        # Adding para would exceed max: flush current chunk
        if current_tokens + para_tokens > MAX_TOKENS and current_chunk:
            chunk_text = " ".join(current_chunk)
            if count_tokens(chunk_text) >= MIN_TOKENS:
                chunks.append(chunk_text)
            # Start new chunk with overlap from end of previous
            overlap_text = _get_overlap(current_chunk)
            current_chunk = [overlap_text, para] if overlap_text else [para]
            current_tokens = count_tokens(" ".join(current_chunk))
        else:
            current_chunk.append(para)
            current_tokens += para_tokens

    if current_chunk:
        chunk_text = " ".join(current_chunk)
        if count_tokens(chunk_text) >= MIN_TOKENS:
            chunks.append(chunk_text)

    return chunks


def _get_overlap(chunk_parts: List[str]) -> str:
    """Return last N tokens of current chunk as overlap context."""
    full_text = " ".join(chunk_parts)
    tokens = ENCODING.encode(full_text)
    overlap = tokens[-OVERLAP_TOKENS:]
    return ENCODING.decode(overlap) if len(overlap) > 10 else ""