Home/Blog/RAG Chunking Strategy for Vector Search: Chunk Size, Overlap, and Retrieval Quality
Back to Blog
May 23, 2026RAG & Retrieval

RAG Chunking Strategy for Vector Search: Chunk Size, Overlap, and Retrieval Quality

How RAG chunking strategy shapes vector search quality through chunk size, overlap, semantic boundaries, parent-child retrieval, and evaluation metrics.

Share on X

Abstract

A RAG system can have a strong embedding model, a fast vector index, and a reasonable prompt, then still miss the sentence that answers the user's question. The answer often vanishes before nearest-neighbor search starts, when the document is cut into chunks that are too small, too large, too noisy, or stripped away from the metadata that made the text retrievable.

RAG chunking strategy sits before vector search. It decides what the embedding model sees, what the index stores, what the retriever can rank, and what the generator receives as evidence. A 200-token chunk can make retrieval precise and brittle. A 1,200-token chunk can preserve context and dilute the matching signal. Overlap can rescue boundary cases and also inflate the index with near-duplicates.

For the index mechanics, use the earlier HNSW and IVF-PQ vector search guide. For diagnosing downstream answer quality, use the hallucination and retrieval coverage guide. The missing layer is the document representation question: what should become a vector in the first place?

RAG chunking strategy turns documents into retrievable evidence units
RAG chunking strategy turns source documents into chunks, vectors, retrieved evidence, and answer context

1. Chunking Changes the Object Being Retrieved

Vector search ranks chunks. Documents only matter after the ingestion pipeline turns them into retrievable units. That sounds obvious until a retrieval trace returns the right file and the wrong evidence.

Take a support policy page with this structure:

Returns
Customers can return unopened hardware within 30 days.

Exceptions
Custom engraved hardware is final sale.
Enterprise contracts may override the standard return window.

A fixed splitter might produce:

Chunk A:
Returns
Customers can return unopened hardware within 30 days.

Chunk B:
Exceptions
Custom engraved hardware is final sale.
Enterprise contracts may override the standard return window.

For the query "Can an enterprise customer return engraved hardware after 30 days?", neither chunk is enough. Chunk A has the standard return window. Chunk B has exceptions. The answer depends on both. If the retriever only returns Chunk A, the model may answer with the standard rule. If it only returns Chunk B, it may answer without the baseline rule.

Now change the chunking:

Chunk A:
Returns
Customers can return unopened hardware within 30 days.

Exceptions
Custom engraved hardware is final sale.
Enterprise contracts may override the standard return window.

The larger chunk improves answerability for that question. It also adds more unrelated words to the embedding. A broad chunk can match many queries weakly, crowding out a smaller chunk that contains a more exact answer.

That is the central trade-off:

Chunk shapeRetrieval benefitRetrieval cost
Small chunksSharp semantic match, lower prompt cost, easier citation.Missing surrounding conditions, tables, definitions, or exceptions.
Large chunksMore local context and fewer split thoughts.Lower embedding specificity, more prompt tokens, noisier ranking.
Overlapped chunksFewer boundary misses.More storage, more duplicate hits, harder deduplication.

The goal is to make the retrievable unit match the evidence unit. A paragraph may be enough for a definition. A whole section may be needed for a policy. A table row may need its header. A code function may need imports, type definitions, and the comment that explains the invariant.

2. Chunk Size Is a Recall-Precision Dial

Chunk size controls the amount of text embedded as one vector. In most systems, it is the first dial people change because frameworks expose it directly.

LlamaIndex's token text splitter exposes chunk_size and chunk_overlap, with token-aware splitting and backup separators. LangChain's text splitter docs show RecursiveCharacterTextSplitter with the same core controls, including chunk_size and chunk_overlap, while preserving larger text units where possible. OpenAI's retrieval docs show static chunking controls for vector store files, including max_chunk_size_tokens and chunk_overlap_tokens; the current default is 800-token chunks with 400-token overlap.

Those defaults are starting points. A handbook, a codebase, financial filings, call transcripts, research PDFs, and product documentation all deserve separate measurement.

A useful first-order model:

NchunksTOSON_{\text{chunks}} \approx \left\lceil \frac{T - O}{S - O} \right\rceil

where:

  • TT is document length in tokens.
  • SS is chunk size.
  • OO is overlap.
  • SOS - O is stride.

For a 20,000-token manual:

import math

def chunk_count(tokens: int, chunk_size: int, overlap: int) -> int:
    stride = chunk_size - overlap
    return math.ceil((tokens - overlap) / stride)

for size, overlap in [(400, 80), (800, 200), (800, 400), (1200, 200)]:
    print(size, overlap, chunk_count(20_000, size, overlap))

Expected output:

400 80 63
800 200 33
800 400 49
1200 200 20

The 800-token chunk with 400-token overlap creates almost 50% more chunks than the same size with 200-token overlap. That affects storage, embedding spend, indexing time, and the number of near-duplicate candidates the retriever must rank.

The retrieval quality curve is usually non-monotonic:

Moving chunk size upWhat improvesWhat can regress
200 -> 400 tokensMore complete sentences and short paragraphs.More mixed topics in dense pages.
400 -> 800 tokensBetter local context for policies, tutorials, and reports.Less precise matching for exact facts.
800 -> 1,200+ tokensBetter section-level coherence.Higher prompt cost and more diluted embeddings.

Small chunks favor precision. Large chunks favor context. The right size depends on query distribution.

For "What is the refund window?", a short chunk with the exact sentence is enough. For "Can enterprise customers return engraved hardware?", a larger chunk or parent-section expansion is safer. For "Which API parameter controls vector store chunk overlap?", a precise chunk with code and parameter names will beat a broad page-level chunk.

Chunk size and overlap change retrieval precision, context, duplication, and prompt cost
Chunk size and overlap change retrieval precision, context coverage, duplication, and prompt cost

3. Overlap Fixes Boundaries and Creates Duplicate Evidence

Overlap exists because real documents ignore artificial chunk boundaries. A definition starts at the end of one chunk. A caveat begins in the next. A table header sits above the row. A function call is separated from the type that explains it.

Overlap carries some trailing context forward:

Chunk 1, no overlap:
... Enterprise contracts may override the standard return window.

Chunk 2, no overlap:
Custom engraved hardware is final sale.

Chunk 1, with overlap:
... Customers can return unopened hardware within 30 days.
Enterprise contracts may override the standard return window.

Chunk 2, with overlap:
Enterprise contracts may override the standard return window.
Custom engraved hardware is final sale.

The overlapped version makes each boundary chunk more answerable. It also means the same sentence can appear in several vectors. When top-k retrieval returns three variants of the same local passage, the model receives less diverse evidence than the retrieval count suggests.

A practical overlap policy:

Corpus shapeStarting overlapWhy
Short FAQ entries0-10%Entries are already atomic.
Product docs and tutorials10-20%Steps and caveats often cross paragraph boundaries.
Legal, policy, and compliance text15-30%Conditions and exceptions need continuity.
CodeFunction-aware first, then 5-15%Syntax boundaries matter more than raw overlap.
TablesPreserve headers with rowsToken overlap alone can split semantics.

Measure overlap as a boundary recall tool. If increasing overlap raises hit rate while the top results become mostly duplicates, the retriever may need deduplication, parent-child retrieval, or max marginal relevance.

A simple duplicate check:

def duplicate_ratio(results):
    parent_ids = [r["parent_id"] for r in results]
    return 1 - (len(set(parent_ids)) / len(parent_ids))

results = [
    {"chunk_id": "policy-12-a", "parent_id": "returns"},
    {"chunk_id": "policy-12-b", "parent_id": "returns"},
    {"chunk_id": "policy-13-a", "parent_id": "exceptions"},
    {"chunk_id": "shipping-02-a", "parent_id": "shipping"},
]

print(duplicate_ratio(results))

Expected output:

0.25

That number is a smoke test rather than a universal metric. If the duplicate ratio climbs as overlap increases, the recall gain may be coming from repeated evidence rather than broader coverage.

4. Semantic Boundaries Beat Blind Token Windows

A fixed 800-token window is easy to operate. It also cuts through document structure.

The better default is a hierarchy:

document
  section
    subsection
      paragraph
        sentence
          token fallback

Split at the highest boundary that keeps the chunk under the size budget. Then use token splitting only when the natural unit is too large.

For documentation:

H1: Vector stores
H2: File upload
H3: Chunking strategy
Paragraphs and code examples

The chunk should carry headings as metadata or prefix text:

title: Vector stores
section: File upload
subsection: Chunking strategy

You can customize chunking with max_chunk_size_tokens and chunk_overlap_tokens...

That heading context matters because many local paragraphs are written with assumed context. A paragraph that starts with "You can customize this..." may embed poorly when detached from the section that defines "this."

For code, syntax-aware chunking usually beats paragraph splitting:

ArtifactBetter chunk unitMetadata to keep
Python moduleFunction or class, with imports when needed.File path, symbol name, class, dependency imports.
TypeScript componentComponent plus props/types.Route, export name, related hooks.
API docsEndpoint section.Method, path, auth scope, version.
SQLQuery or migration block.Table names, migration id, schema.

For tables, the unit is often a row plus headers, or a small group of rows plus headers. A chunk containing "$0.10/GB/day" without the storage tier, product, and pricing context is too weak to support a grounded answer.

The same rule applies to PDFs. Page boundaries are implementation details. If the question is answered by a caption, a footnote, or a row that continues across pages, the chunker has to preserve that relationship explicitly.

5. Retrieval Quality Needs Chunk-Level Metrics

A chunking change should be evaluated before it reaches production. The basic question is:

For the queries users actually ask, does the retriever return chunks
that contain enough evidence to answer correctly?

Start with a small evaluation set:

FieldExample
Query"Can enterprise customers return engraved hardware?"
Expected source idsreturns-policy#exceptions, returns-policy#enterprise-contracts
Answerable from one chunk?No
Required evidenceStandard return window, engraved exception, enterprise override.
Negative distractorsGeneral returns FAQ, warranty policy, shipping exceptions.

Then compare chunking configurations:

400 tokens / 80 overlap
800 tokens / 200 overlap
800 tokens / 400 overlap
section-aware / 15% overlap
parent-child retrieval

Use retrieval metrics before answer metrics:

MetricWhat it tells you
Hit rate@kDid any expected chunk appear in the top k?
Recall@kHow much required evidence was retrieved?
MRRHow early did the first useful chunk appear?
Precision@kHow much of the returned context was useful?
Context recallWere the answer-supporting claims covered by retrieved context?
Duplicate ratioDid top-k collapse into repeated overlapping chunks?

LlamaIndex's retrieval evaluator exposes standard ranking metrics such as hit rate, MRR, precision, recall, average precision, and NDCG. Ragas separates context recall into claim support over retrieved context, and also supports non-LLM or ID-based context recall when reference contexts are available.

The key comparison is the metric profile:

small chunks:
  precision@5 improves
  recall@5 drops on multi-hop or exception-heavy questions

large chunks:
  recall@5 improves
  precision@5 drops
  answer latency rises from larger context

high overlap:
  hit_rate@5 improves
  duplicate_ratio rises
  diverse evidence may drop

The answer model should be evaluated after the retriever passes this first gate. If the retriever misses the required source, answer grading mostly measures the model's willingness to guess.

6. Parent-Child Retrieval Separates Search From Context

One common fix is to search over small chunks and answer with larger parent chunks.

child chunk:
  250-token paragraph used for embedding and nearest-neighbor search

parent chunk:
  surrounding section passed to the generator after a child hit

This gives the retriever a sharp vector while giving the generator enough context. It is especially useful for:

  • API documentation where a parameter paragraph needs the endpoint section.
  • Legal text where a clause needs neighboring exceptions.
  • Tutorials where one step depends on the previous step.
  • Tables where a row needs headers and notes.
  • Code where a function needs class-level state or imports.

The trade-off is context expansion. If every hit expands to a large parent, top-k can become expensive quickly:

prompt tokenskP\text{prompt tokens} \approx k \cdot P

where PP is parent size. With k=8k=8 and 1,200-token parents, retrieval can inject 9,600 tokens before the user question, instructions, citations, or conversation history.

A safer policy:

retrieve 20 child chunks
deduplicate by parent id
rerank child chunks
expand top 4 unique parents
trim parent text around matched child when possible

That policy keeps high recall during search, preserves enough context for generation, and avoids filling the prompt with repeated sibling chunks from the same section.

Parent-child retrieval also improves citations. The child chunk tells you the exact evidence span. The parent chunk gives the model surrounding context. The citation can point to the child span while the answer uses the parent text for disambiguation.

7. Chunking Policy Should Follow Query Distribution

Different products ask different questions of the same corpus.

A developer-docs chatbot sees exact parameter names, error messages, version constraints, and code snippets. A customer-support bot sees policy questions, account-specific constraints, and procedural steps. A research assistant sees broad synthesis queries and multi-hop evidence gathering.

That changes the chunking strategy:

WorkloadBetter starting strategy
Exact API lookupSmaller chunks, strong metadata, hybrid search, code-aware splitting.
Support policy QASection-aware chunks, moderate overlap, parent expansion.
Research reportsLarger semantic chunks, reranking, summary metadata.
Compliance searchClause-aware chunks, high boundary preservation, strict citations.
Codebase retrievalAST/symbol-aware chunks, file path metadata, dependency expansion.
Meeting transcriptsSpeaker/time metadata, topic segmentation, summary headers.

The query distribution should be sampled from real traces when possible. When real traces are unavailable, create a synthetic set that includes the queries that usually break retrieval:

  • exact IDs, SKUs, error codes, and parameter names
  • "except when" policy questions
  • comparisons across two sections
  • table lookups
  • questions that depend on headings
  • stale or deleted facts
  • multi-hop questions where evidence lives in separate chunks

The "Lost in the Middle" long-context result is relevant here. Longer context windows can still use retrieved passages unevenly. Liu et al. found that model performance can change when relevant information moves inside the input context, with weaker use of information in the middle for many evaluated models. Chunking and ranking therefore decide both what evidence enters the prompt and where it lands.

That makes retrieval ordering part of chunking policy. If a large parent expansion pushes the best evidence into the middle of a long context, a nominal recall win can become an answer-quality loss.

8. A Practical RAG Chunking Strategy Tuning Loop

Treat chunking as an experiment with a small grid and a fixed evaluation set.

Start with four candidates:

CandidateSearch chunkOverlapExpansion
Small precise350 tokens50 tokensNone
Balanced800 tokens160 tokensNone
Boundary-safe800 tokens320 tokensDeduplicate top-k
Parent-child250 tokens50 tokens1,000-token parent

Then run the same queries through each candidate and record:

retrieval:
  hit_rate@5
  recall@5
  precision@5
  mrr
  duplicate_ratio

generation:
  answer accuracy
  citation precision
  unsupported claims
  prompt tokens
  latency

A useful decision rule:

If recall is low, increase context preservation.
If precision is low, reduce chunk size or add metadata filters.
If duplicate ratio is high, reduce overlap or deduplicate by parent.
If answers cite the right document but miss the answer, use parent expansion.
If exact identifiers fail, add sparse or hybrid search.
If latency rises faster than quality, lower k before shrinking all chunks.

Chunking should also be versioned. Store the chunking policy with each indexed document:

{
  "chunking_policy": "docs-section-v3",
  "chunk_size_tokens": 800,
  "chunk_overlap_tokens": 160,
  "splitter": "heading_then_sentence",
  "source_revision": "git:8f2c1ab",
  "parent_id": "returns-policy#exceptions",
  "chunk_index": 12
}

Without that metadata, retrieval regressions are hard to diagnose. A production incident may look like an embedding model problem when the real change was a parser update that removed headings, a chunk-size change that split tables, or a reindex that dropped parent ids.

9. Decision Rules That Hold Up in Practice

Use these defaults as a starting point, then validate against your corpus:

SituationStart here
General product docs600-900 token semantic chunks, 10-20% overlap.
FAQ pagesOne entry per chunk, low overlap, strong title metadata.
Dense policy textSection-aware chunks, 15-30% overlap, parent expansion.
TablesRow groups with repeated headers and table title metadata.
CodeSymbol-aware chunks with file path, imports, and parent class.
Long research PDFsSection-aware chunks plus summary metadata and reranking.

Keep chunks self-contained enough to answer a narrow question, but small enough that the embedding still describes a specific idea.

The chunk should usually include:

  • document title
  • section path
  • stable source id
  • page, line, or byte offsets where available
  • timestamps for transcripts
  • permissions and tenant metadata
  • parent-child relationship
  • chunking policy version

The chunk should usually avoid:

  • repeated navigation text
  • boilerplate footers
  • unrelated neighboring sections
  • tables without headers
  • code fragments without symbols
  • PDF extraction artifacts
  • duplicated overlap returned repeatedly to the generator

Those rules are more durable than a single magic chunk size.

10. What to Benchmark Before You Trust a RAG Chunking Strategy

Before promoting a new RAG chunking strategy, run a benchmark that forces the chunker to show its weaknesses.

Use at least these slices:

SliceWhy it catches regressions
Single-fact lookupTests precision and citation span quality.
Exception questionsTests boundary preservation and parent context.
Multi-hop questionsTests whether related chunks appear together.
Table questionsTests header retention and row grouping.
Exact identifiersTests sparse/hybrid fallback and metadata search.
Long documentsTests ranking, prompt placement, and context dilution.
Deleted or stale contentTests source revision and reindex behavior.

Then inspect real retrieved chunks alongside aggregate scores. The most useful debugging view is often:

query
expected source ids
top 10 retrieved chunks
chunk text
parent id
score
metadata filters applied
final context sent to the model
answer and citations

That trace connects chunking to retrieval and retrieval to generation. It also separates the common cases:

SymptomLikely chunking issue
Right document, wrong answerChunk lacks the specific evidence span.
Right sentence, wrong conclusionMissing neighboring condition or exception.
Many similar chunks in top-kOverlap too high or deduplication missing.
Exact product code missedDense-only retrieval needs sparse or metadata support.
Table answer wrongHeader or row relationship was split.
Citation points to broad sectionChunk is too large for precise grounding.

Chunking is upstream of every RAG quality metric. The retriever can only rank the objects the ingestion pipeline created.

Conclusion

RAG chunking strategy is the document-design layer of vector search. Chunk size controls recall and precision. Overlap protects boundaries and adds duplicates. Semantic splitting keeps human structure intact. Parent-child retrieval separates search precision from generation context.

The practical path is simple: choose two or three chunking policies, run them against a fixed retrieval eval set, inspect the misses, and promote the policy that improves answerability without flooding the prompt with duplicated or diluted evidence.

References

  1. Patrick Lewis et al. "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." arXiv, 2020. https://arxiv.org/abs/2005.11401
  2. Nelson F. Liu et al. "Lost in the Middle: How Language Models Use Long Contexts." arXiv, 2023. https://arxiv.org/abs/2307.03172
  3. OpenAI. "Retrieval." OpenAI API documentation. https://platform.openai.com/docs/guides/retrieval
  4. LlamaIndex. "Token text splitter." LlamaIndex documentation. https://docs.llamaindex.ai/en/stable/api_reference/node_parsers/token_text_splitter/
  5. LangChain. "Splitting recursively." LangChain documentation. https://docs.langchain.com/oss/python/integrations/splitters/recursive_text_splitter
  6. LlamaIndex. "Retrieval Evaluation." LlamaIndex examples documentation. https://developers.llamaindex.ai/python/examples/evaluation/retrieval/retriever_eval/
  7. Ragas. "Context Recall." Ragas documentation. https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/context_recall/
  8. Pinecone. "Chunking Strategies for LLM Applications." Pinecone Learning Center. https://www.pinecone.io/learn/chunking-strategies/