Abstract
A RAG system can have a strong embedding model, a fast vector index, and a reasonable prompt, then still miss the sentence that answers the user's question. The answer often vanishes before nearest-neighbor search starts, when the document is cut into chunks that are too small, too large, too noisy, or stripped away from the metadata that made the text retrievable.
RAG chunking strategy sits before vector search. It decides what the embedding model sees, what the index stores, what the retriever can rank, and what the generator receives as evidence. A 200-token chunk can make retrieval precise and brittle. A 1,200-token chunk can preserve context and dilute the matching signal. Overlap can rescue boundary cases and also inflate the index with near-duplicates.
For the index mechanics, use the earlier HNSW and IVF-PQ vector search guide. For diagnosing downstream answer quality, use the hallucination and retrieval coverage guide. The missing layer is the document representation question: what should become a vector in the first place?
1. Chunking Changes the Object Being Retrieved
Vector search ranks chunks. Documents only matter after the ingestion pipeline turns them into retrievable units. That sounds obvious until a retrieval trace returns the right file and the wrong evidence.
Take a support policy page with this structure:
Returns
Customers can return unopened hardware within 30 days.
Exceptions
Custom engraved hardware is final sale.
Enterprise contracts may override the standard return window.A fixed splitter might produce:
Chunk A:
Returns
Customers can return unopened hardware within 30 days.
Chunk B:
Exceptions
Custom engraved hardware is final sale.
Enterprise contracts may override the standard return window.For the query "Can an enterprise customer return engraved hardware after 30 days?", neither chunk is enough. Chunk A has the standard return window. Chunk B has exceptions. The answer depends on both. If the retriever only returns Chunk A, the model may answer with the standard rule. If it only returns Chunk B, it may answer without the baseline rule.
Now change the chunking:
Chunk A:
Returns
Customers can return unopened hardware within 30 days.
Exceptions
Custom engraved hardware is final sale.
Enterprise contracts may override the standard return window.The larger chunk improves answerability for that question. It also adds more unrelated words to the embedding. A broad chunk can match many queries weakly, crowding out a smaller chunk that contains a more exact answer.
That is the central trade-off:
| Chunk shape | Retrieval benefit | Retrieval cost |
|---|---|---|
| Small chunks | Sharp semantic match, lower prompt cost, easier citation. | Missing surrounding conditions, tables, definitions, or exceptions. |
| Large chunks | More local context and fewer split thoughts. | Lower embedding specificity, more prompt tokens, noisier ranking. |
| Overlapped chunks | Fewer boundary misses. | More storage, more duplicate hits, harder deduplication. |
The goal is to make the retrievable unit match the evidence unit. A paragraph may be enough for a definition. A whole section may be needed for a policy. A table row may need its header. A code function may need imports, type definitions, and the comment that explains the invariant.
2. Chunk Size Is a Recall-Precision Dial
Chunk size controls the amount of text embedded as one vector. In most systems, it is the first dial people change because frameworks expose it directly.
LlamaIndex's token text splitter exposes chunk_size and chunk_overlap, with token-aware splitting and backup separators. LangChain's text splitter docs show RecursiveCharacterTextSplitter with the same core controls, including chunk_size and chunk_overlap, while preserving larger text units where possible. OpenAI's retrieval docs show static chunking controls for vector store files, including max_chunk_size_tokens and chunk_overlap_tokens; the current default is 800-token chunks with 400-token overlap.
Those defaults are starting points. A handbook, a codebase, financial filings, call transcripts, research PDFs, and product documentation all deserve separate measurement.
A useful first-order model:
where:
- is document length in tokens.
- is chunk size.
- is overlap.
- is stride.
For a 20,000-token manual:
import math
def chunk_count(tokens: int, chunk_size: int, overlap: int) -> int:
stride = chunk_size - overlap
return math.ceil((tokens - overlap) / stride)
for size, overlap in [(400, 80), (800, 200), (800, 400), (1200, 200)]:
print(size, overlap, chunk_count(20_000, size, overlap))Expected output:
400 80 63
800 200 33
800 400 49
1200 200 20The 800-token chunk with 400-token overlap creates almost 50% more chunks than the same size with 200-token overlap. That affects storage, embedding spend, indexing time, and the number of near-duplicate candidates the retriever must rank.
The retrieval quality curve is usually non-monotonic:
| Moving chunk size up | What improves | What can regress |
|---|---|---|
| 200 -> 400 tokens | More complete sentences and short paragraphs. | More mixed topics in dense pages. |
| 400 -> 800 tokens | Better local context for policies, tutorials, and reports. | Less precise matching for exact facts. |
| 800 -> 1,200+ tokens | Better section-level coherence. | Higher prompt cost and more diluted embeddings. |
Small chunks favor precision. Large chunks favor context. The right size depends on query distribution.
For "What is the refund window?", a short chunk with the exact sentence is enough. For "Can enterprise customers return engraved hardware?", a larger chunk or parent-section expansion is safer. For "Which API parameter controls vector store chunk overlap?", a precise chunk with code and parameter names will beat a broad page-level chunk.
3. Overlap Fixes Boundaries and Creates Duplicate Evidence
Overlap exists because real documents ignore artificial chunk boundaries. A definition starts at the end of one chunk. A caveat begins in the next. A table header sits above the row. A function call is separated from the type that explains it.
Overlap carries some trailing context forward:
Chunk 1, no overlap:
... Enterprise contracts may override the standard return window.
Chunk 2, no overlap:
Custom engraved hardware is final sale.
Chunk 1, with overlap:
... Customers can return unopened hardware within 30 days.
Enterprise contracts may override the standard return window.
Chunk 2, with overlap:
Enterprise contracts may override the standard return window.
Custom engraved hardware is final sale.The overlapped version makes each boundary chunk more answerable. It also means the same sentence can appear in several vectors. When top-k retrieval returns three variants of the same local passage, the model receives less diverse evidence than the retrieval count suggests.
A practical overlap policy:
| Corpus shape | Starting overlap | Why |
|---|---|---|
| Short FAQ entries | 0-10% | Entries are already atomic. |
| Product docs and tutorials | 10-20% | Steps and caveats often cross paragraph boundaries. |
| Legal, policy, and compliance text | 15-30% | Conditions and exceptions need continuity. |
| Code | Function-aware first, then 5-15% | Syntax boundaries matter more than raw overlap. |
| Tables | Preserve headers with rows | Token overlap alone can split semantics. |
Measure overlap as a boundary recall tool. If increasing overlap raises hit rate while the top results become mostly duplicates, the retriever may need deduplication, parent-child retrieval, or max marginal relevance.
A simple duplicate check:
def duplicate_ratio(results):
parent_ids = [r["parent_id"] for r in results]
return 1 - (len(set(parent_ids)) / len(parent_ids))
results = [
{"chunk_id": "policy-12-a", "parent_id": "returns"},
{"chunk_id": "policy-12-b", "parent_id": "returns"},
{"chunk_id": "policy-13-a", "parent_id": "exceptions"},
{"chunk_id": "shipping-02-a", "parent_id": "shipping"},
]
print(duplicate_ratio(results))Expected output:
0.25That number is a smoke test rather than a universal metric. If the duplicate ratio climbs as overlap increases, the recall gain may be coming from repeated evidence rather than broader coverage.
4. Semantic Boundaries Beat Blind Token Windows
A fixed 800-token window is easy to operate. It also cuts through document structure.
The better default is a hierarchy:
document
section
subsection
paragraph
sentence
token fallbackSplit at the highest boundary that keeps the chunk under the size budget. Then use token splitting only when the natural unit is too large.
For documentation:
H1: Vector stores
H2: File upload
H3: Chunking strategy
Paragraphs and code examplesThe chunk should carry headings as metadata or prefix text:
title: Vector stores
section: File upload
subsection: Chunking strategy
You can customize chunking with max_chunk_size_tokens and chunk_overlap_tokens...That heading context matters because many local paragraphs are written with assumed context. A paragraph that starts with "You can customize this..." may embed poorly when detached from the section that defines "this."
For code, syntax-aware chunking usually beats paragraph splitting:
| Artifact | Better chunk unit | Metadata to keep |
|---|---|---|
| Python module | Function or class, with imports when needed. | File path, symbol name, class, dependency imports. |
| TypeScript component | Component plus props/types. | Route, export name, related hooks. |
| API docs | Endpoint section. | Method, path, auth scope, version. |
| SQL | Query or migration block. | Table names, migration id, schema. |
For tables, the unit is often a row plus headers, or a small group of rows plus headers. A chunk containing "$0.10/GB/day" without the storage tier, product, and pricing context is too weak to support a grounded answer.
The same rule applies to PDFs. Page boundaries are implementation details. If the question is answered by a caption, a footnote, or a row that continues across pages, the chunker has to preserve that relationship explicitly.
5. Retrieval Quality Needs Chunk-Level Metrics
A chunking change should be evaluated before it reaches production. The basic question is:
For the queries users actually ask, does the retriever return chunks
that contain enough evidence to answer correctly?Start with a small evaluation set:
| Field | Example |
|---|---|
| Query | "Can enterprise customers return engraved hardware?" |
| Expected source ids | returns-policy#exceptions, returns-policy#enterprise-contracts |
| Answerable from one chunk? | No |
| Required evidence | Standard return window, engraved exception, enterprise override. |
| Negative distractors | General returns FAQ, warranty policy, shipping exceptions. |
Then compare chunking configurations:
400 tokens / 80 overlap
800 tokens / 200 overlap
800 tokens / 400 overlap
section-aware / 15% overlap
parent-child retrievalUse retrieval metrics before answer metrics:
| Metric | What it tells you |
|---|---|
| Hit rate@k | Did any expected chunk appear in the top k? |
| Recall@k | How much required evidence was retrieved? |
| MRR | How early did the first useful chunk appear? |
| Precision@k | How much of the returned context was useful? |
| Context recall | Were the answer-supporting claims covered by retrieved context? |
| Duplicate ratio | Did top-k collapse into repeated overlapping chunks? |
LlamaIndex's retrieval evaluator exposes standard ranking metrics such as hit rate, MRR, precision, recall, average precision, and NDCG. Ragas separates context recall into claim support over retrieved context, and also supports non-LLM or ID-based context recall when reference contexts are available.
The key comparison is the metric profile:
small chunks:
precision@5 improves
recall@5 drops on multi-hop or exception-heavy questions
large chunks:
recall@5 improves
precision@5 drops
answer latency rises from larger context
high overlap:
hit_rate@5 improves
duplicate_ratio rises
diverse evidence may dropThe answer model should be evaluated after the retriever passes this first gate. If the retriever misses the required source, answer grading mostly measures the model's willingness to guess.
6. Parent-Child Retrieval Separates Search From Context
One common fix is to search over small chunks and answer with larger parent chunks.
child chunk:
250-token paragraph used for embedding and nearest-neighbor search
parent chunk:
surrounding section passed to the generator after a child hitThis gives the retriever a sharp vector while giving the generator enough context. It is especially useful for:
- API documentation where a parameter paragraph needs the endpoint section.
- Legal text where a clause needs neighboring exceptions.
- Tutorials where one step depends on the previous step.
- Tables where a row needs headers and notes.
- Code where a function needs class-level state or imports.
The trade-off is context expansion. If every hit expands to a large parent, top-k can become expensive quickly:
where is parent size. With and 1,200-token parents, retrieval can inject 9,600 tokens before the user question, instructions, citations, or conversation history.
A safer policy:
retrieve 20 child chunks
deduplicate by parent id
rerank child chunks
expand top 4 unique parents
trim parent text around matched child when possibleThat policy keeps high recall during search, preserves enough context for generation, and avoids filling the prompt with repeated sibling chunks from the same section.
Parent-child retrieval also improves citations. The child chunk tells you the exact evidence span. The parent chunk gives the model surrounding context. The citation can point to the child span while the answer uses the parent text for disambiguation.
7. Chunking Policy Should Follow Query Distribution
Different products ask different questions of the same corpus.
A developer-docs chatbot sees exact parameter names, error messages, version constraints, and code snippets. A customer-support bot sees policy questions, account-specific constraints, and procedural steps. A research assistant sees broad synthesis queries and multi-hop evidence gathering.
That changes the chunking strategy:
| Workload | Better starting strategy |
|---|---|
| Exact API lookup | Smaller chunks, strong metadata, hybrid search, code-aware splitting. |
| Support policy QA | Section-aware chunks, moderate overlap, parent expansion. |
| Research reports | Larger semantic chunks, reranking, summary metadata. |
| Compliance search | Clause-aware chunks, high boundary preservation, strict citations. |
| Codebase retrieval | AST/symbol-aware chunks, file path metadata, dependency expansion. |
| Meeting transcripts | Speaker/time metadata, topic segmentation, summary headers. |
The query distribution should be sampled from real traces when possible. When real traces are unavailable, create a synthetic set that includes the queries that usually break retrieval:
- exact IDs, SKUs, error codes, and parameter names
- "except when" policy questions
- comparisons across two sections
- table lookups
- questions that depend on headings
- stale or deleted facts
- multi-hop questions where evidence lives in separate chunks
The "Lost in the Middle" long-context result is relevant here. Longer context windows can still use retrieved passages unevenly. Liu et al. found that model performance can change when relevant information moves inside the input context, with weaker use of information in the middle for many evaluated models. Chunking and ranking therefore decide both what evidence enters the prompt and where it lands.
That makes retrieval ordering part of chunking policy. If a large parent expansion pushes the best evidence into the middle of a long context, a nominal recall win can become an answer-quality loss.
8. A Practical RAG Chunking Strategy Tuning Loop
Treat chunking as an experiment with a small grid and a fixed evaluation set.
Start with four candidates:
| Candidate | Search chunk | Overlap | Expansion |
|---|---|---|---|
| Small precise | 350 tokens | 50 tokens | None |
| Balanced | 800 tokens | 160 tokens | None |
| Boundary-safe | 800 tokens | 320 tokens | Deduplicate top-k |
| Parent-child | 250 tokens | 50 tokens | 1,000-token parent |
Then run the same queries through each candidate and record:
retrieval:
hit_rate@5
recall@5
precision@5
mrr
duplicate_ratio
generation:
answer accuracy
citation precision
unsupported claims
prompt tokens
latencyA useful decision rule:
If recall is low, increase context preservation.
If precision is low, reduce chunk size or add metadata filters.
If duplicate ratio is high, reduce overlap or deduplicate by parent.
If answers cite the right document but miss the answer, use parent expansion.
If exact identifiers fail, add sparse or hybrid search.
If latency rises faster than quality, lower k before shrinking all chunks.Chunking should also be versioned. Store the chunking policy with each indexed document:
{
"chunking_policy": "docs-section-v3",
"chunk_size_tokens": 800,
"chunk_overlap_tokens": 160,
"splitter": "heading_then_sentence",
"source_revision": "git:8f2c1ab",
"parent_id": "returns-policy#exceptions",
"chunk_index": 12
}Without that metadata, retrieval regressions are hard to diagnose. A production incident may look like an embedding model problem when the real change was a parser update that removed headings, a chunk-size change that split tables, or a reindex that dropped parent ids.
9. Decision Rules That Hold Up in Practice
Use these defaults as a starting point, then validate against your corpus:
| Situation | Start here |
|---|---|
| General product docs | 600-900 token semantic chunks, 10-20% overlap. |
| FAQ pages | One entry per chunk, low overlap, strong title metadata. |
| Dense policy text | Section-aware chunks, 15-30% overlap, parent expansion. |
| Tables | Row groups with repeated headers and table title metadata. |
| Code | Symbol-aware chunks with file path, imports, and parent class. |
| Long research PDFs | Section-aware chunks plus summary metadata and reranking. |
Keep chunks self-contained enough to answer a narrow question, but small enough that the embedding still describes a specific idea.
The chunk should usually include:
- document title
- section path
- stable source id
- page, line, or byte offsets where available
- timestamps for transcripts
- permissions and tenant metadata
- parent-child relationship
- chunking policy version
The chunk should usually avoid:
- repeated navigation text
- boilerplate footers
- unrelated neighboring sections
- tables without headers
- code fragments without symbols
- PDF extraction artifacts
- duplicated overlap returned repeatedly to the generator
Those rules are more durable than a single magic chunk size.
10. What to Benchmark Before You Trust a RAG Chunking Strategy
Before promoting a new RAG chunking strategy, run a benchmark that forces the chunker to show its weaknesses.
Use at least these slices:
| Slice | Why it catches regressions |
|---|---|
| Single-fact lookup | Tests precision and citation span quality. |
| Exception questions | Tests boundary preservation and parent context. |
| Multi-hop questions | Tests whether related chunks appear together. |
| Table questions | Tests header retention and row grouping. |
| Exact identifiers | Tests sparse/hybrid fallback and metadata search. |
| Long documents | Tests ranking, prompt placement, and context dilution. |
| Deleted or stale content | Tests source revision and reindex behavior. |
Then inspect real retrieved chunks alongside aggregate scores. The most useful debugging view is often:
query
expected source ids
top 10 retrieved chunks
chunk text
parent id
score
metadata filters applied
final context sent to the model
answer and citationsThat trace connects chunking to retrieval and retrieval to generation. It also separates the common cases:
| Symptom | Likely chunking issue |
|---|---|
| Right document, wrong answer | Chunk lacks the specific evidence span. |
| Right sentence, wrong conclusion | Missing neighboring condition or exception. |
| Many similar chunks in top-k | Overlap too high or deduplication missing. |
| Exact product code missed | Dense-only retrieval needs sparse or metadata support. |
| Table answer wrong | Header or row relationship was split. |
| Citation points to broad section | Chunk is too large for precise grounding. |
Chunking is upstream of every RAG quality metric. The retriever can only rank the objects the ingestion pipeline created.
Conclusion
RAG chunking strategy is the document-design layer of vector search. Chunk size controls recall and precision. Overlap protects boundaries and adds duplicates. Semantic splitting keeps human structure intact. Parent-child retrieval separates search precision from generation context.
The practical path is simple: choose two or three chunking policies, run them against a fixed retrieval eval set, inspect the misses, and promote the policy that improves answerability without flooding the prompt with duplicated or diluted evidence.
References
- Patrick Lewis et al. "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." arXiv, 2020. https://arxiv.org/abs/2005.11401
- Nelson F. Liu et al. "Lost in the Middle: How Language Models Use Long Contexts." arXiv, 2023. https://arxiv.org/abs/2307.03172
- OpenAI. "Retrieval." OpenAI API documentation. https://platform.openai.com/docs/guides/retrieval
- LlamaIndex. "Token text splitter." LlamaIndex documentation. https://docs.llamaindex.ai/en/stable/api_reference/node_parsers/token_text_splitter/
- LangChain. "Splitting recursively." LangChain documentation. https://docs.langchain.com/oss/python/integrations/splitters/recursive_text_splitter
- LlamaIndex. "Retrieval Evaluation." LlamaIndex examples documentation. https://developers.llamaindex.ai/python/examples/evaluation/retrieval/retriever_eval/
- Ragas. "Context Recall." Ragas documentation. https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/context_recall/
- Pinecone. "Chunking Strategies for LLM Applications." Pinecone Learning Center. https://www.pinecone.io/learn/chunking-strategies/
