Chunking Strategies for RAG

Chunking is how you split documents into pieces before embedding and indexing. Too large and retrieval may pull in irrelevant context; too small and you lose coherence. Good chunking balances size, overlap, and semantic boundaries so that the right chunks are retrieved and the LLM gets usable context.

Document split into chunks for RAG — Chunking for retrieval.

Strategies

Fixed-size chunking (e.g. 512 tokens with 50-token overlap) is simple and works well. Sentence or paragraph boundaries help keep chunks meaningful. Some systems use semantic chunking: split where topic or meaning shifts. Overlap between chunks reduces the risk of cutting important context at boundaries.

python

def chunk_with_overlap(text: str, chunk_size: int = 512, overlap: int = 50):
    tokens = tokenize(text)
    chunks = []
    for i in range(0, len(tokens), chunk_size - overlap):
        chunk = tokens[i:i + chunk_size]
        chunks.append(detokenize(chunk))
    return chunks

Testing matters

Try different chunk sizes and overlaps on your documents. Measure retrieval accuracy for real questions. Long policies or manuals may need larger chunks or hierarchical retrieval so the model can zoom in on the right section.

text

# Typical ranges (tokens)
# Short chunks:  128-256  (precise retrieval, may lose context)
# Medium:        512      (common default, balance)
# Long:          1024+    (more context, more noise)
# Overlap:       10-20%   (reduces boundary cuts)

Strategies

Testing matters

Related Articles