RAG Finds the Right Document but Gives the Wrong Answer

Abstract

A RAG system can have a strong embedding model, a fast vector index, and a reasonable prompt, then still miss the sentence that answers the user's question. The answer often vanishes before nearest-neighbor search starts, when the document is cut into chunks that are too small, too large, too noisy, or stripped away from the metadata that made the text retrievable.

RAG chunking strategy sits before vector search (HNSW and IVF-PQ vector search mechanics) and before answer-quality diagnosis (attribution traces and retrieval coverage metrics). It decides what the embedding model sees, what the index stores, what the retriever can rank, and what the generator receives as evidence. A 200-token chunk can make retrieval precise and brittle. A 1,200-token chunk can preserve context and dilute the matching signal. Overlap can rescue boundary cases and also inflate the index with near-duplicates.

Structure-aware RAG chunking pipeline from source documents to retrieved answer context

1. Chunking Changes the Object Being Retrieved

Vector search ranks chunks. Documents only matter after the ingestion pipeline turns them into retrievable units. That sounds obvious until a retrieval trace returns the right file and the wrong evidence.

Take a support policy page with this structure:

Section	Text
Returns	Customers can return unopened hardware within 30 days.
Exceptions	Custom engraved hardware is final sale. Enterprise contracts may override the standard return window.

For the query "Can an enterprise customer return engraved hardware after 30 days?", neither isolated section is enough. The standard return window, the engraved-hardware exception, and the enterprise-contract caveat all matter. If the retriever only returns the Returns section, the model may answer with the standard rule. If it only returns the Exceptions section, it may answer without the baseline rule.

Now change the chunking:

Chunk	Evidence retained
Small split	One chunk contains the standard return window. Another chunk contains the exceptions. Each chunk is easier to match precisely and easier to under-answer from.
Larger section chunk	One chunk contains the return window, engraved exception, and enterprise override. The answer is better supported, but the embedding now mixes more ideas into one vector.

The larger chunk improves answerability for that question. It also adds more unrelated words to the embedding. A broad chunk can match many queries weakly, crowding out a smaller chunk that contains a more exact answer.

That is the central trade-off:

Chunk shape	Retrieval benefit	Retrieval cost
Small chunks	Sharp semantic match, lower prompt cost, easier citation.	Missing surrounding conditions, tables, definitions, or exceptions.
Large chunks	More local context and fewer split thoughts.	Lower embedding specificity, more prompt tokens, noisier ranking.
Overlapped chunks	Fewer boundary misses.	More storage, more duplicate hits, harder deduplication.

The goal is to make the retrievable unit match the evidence unit. A paragraph may be enough for a definition. A whole section may be needed for a policy. A table row may need its header. A code function may need imports, type definitions, and the comment that explains the invariant.

2. Chunk Size Is a Recall-Precision Dial

Chunk size controls the amount of text embedded as one vector. In most systems, it is the first dial people change because frameworks expose it directly.

Most chunking systems implement the same abstraction with different boundary rules. LlamaIndex treats chunking as token-budgeted node parsing: it keeps text under a target token size, carries overlap forward, and falls back through separators when natural boundaries do not fit. LangChain's recursive splitter starts from the largest textual unit and descends through smaller separators, preserving paragraphs and sentences when they fit before cutting more aggressively. OpenAI vector stores make chunking an ingestion policy on each file: the static strategy sets the maximum token budget and overlap before chunks are embedded and stored, with the current default at 800-token chunks and 400-token overlap.

Those defaults are starting points. A handbook, a codebase, financial filings, call transcripts, research PDFs, and product documentation all deserve separate measurement.

A useful first-order model:

N_{\text{chunks}} \approx \left\lceil \frac{T - O}{S - O} \right\rceil

where:

$T$ is document length in tokens.
$S$ is chunk size.
$O$ is overlap.
$S - O$ is stride.

For a 20,000-token manual, the storage and ranking surface changes quickly:

Chunk size	Overlap	Stride	Approximate chunks
400	80	320	63
800	200	600	33
800	400	400	49
1,200	200	1,000	20

The 800-token chunk with 400-token overlap creates almost 50% more chunks than the same size with 200-token overlap. That affects storage, embedding spend, indexing time, and the number of near-duplicate candidates the retriever must rank.

The retrieval quality curve is usually non-monotonic:

Moving chunk size up	What improves	What can regress
200 -> 400 tokens	More complete sentences and short paragraphs.	More mixed topics in dense pages.
400 -> 800 tokens	Better local context for policies, tutorials, and reports.	Less precise matching for exact facts.
800 -> 1,200+ tokens	Better section-level coherence.	Higher prompt cost and more diluted embeddings.

Small chunks favor precision. Large chunks favor context. The right size depends on query distribution.

For "What is the refund window?", a short chunk with the exact sentence is enough. For "Can enterprise customers return engraved hardware?", a larger chunk or parent-section expansion is safer. For "Which API parameter controls vector store chunk overlap?", a precise chunk with code and parameter names will beat a broad page-level chunk.

Chunk size and overlap change retrieval precision, context coverage, duplication, and prompt cost

3. Overlap Fixes Boundaries and Creates Duplicate Evidence

Overlap exists because real documents ignore artificial chunk boundaries. A definition starts at the end of one chunk. A caveat begins in the next. A table header sits above the row. A function call is separated from the type that explains it.

Overlap carries some trailing context forward:

Boundary case	What the retriever sees
No overlap	One chunk ends with "Enterprise contracts may override the standard return window." The next chunk starts with "Custom engraved hardware is final sale." Each hit can miss the other condition.
With overlap	The enterprise-contract sentence appears beside the engraved-hardware exception, so each boundary chunk is more answerable.

The overlapped version makes each boundary chunk more useful. It also means the same sentence can appear in several vectors. When top-k retrieval returns three variants of the same local passage, the model receives less diverse evidence than the retrieval count suggests.

A practical overlap policy:

Corpus shape	Starting overlap	Why
Short FAQ entries	0-10%	Entries are already atomic.
Product docs and tutorials	10-20%	Steps and caveats often cross paragraph boundaries.
Legal, policy, and compliance text	15-30%	Conditions and exceptions need continuity.
Code	Function-aware first, then 5-15%	Syntax boundaries matter more than raw overlap.
Tables	Preserve headers with rows	Token overlap alone can split semantics.

Measure overlap as a boundary recall tool. If increasing overlap raises hit rate while the top results become mostly duplicates, the retriever may need deduplication, parent-child retrieval, or max marginal relevance.

A simple duplicate-ratio smoke test compares unique parent ids with the returned result count:

Top-k result	Parent id
`policy-12-a`	`returns`
`policy-12-b`	`returns`
`policy-13-a`	`exceptions`
`shipping-02-a`	`shipping`

Here, one of four results repeats an already represented parent, so the duplicate ratio is 0.25. That number is not a universal metric. If the duplicate ratio climbs as overlap increases, the recall gain may be coming from repeated evidence rather than broader coverage.

4. Semantic Boundaries Beat Blind Token Windows

A fixed 800-token window is easy to operate. It also cuts through document structure.

The better default is a hierarchy: document, section, subsection, paragraph, sentence, then token fallback. Split at the highest boundary that keeps the chunk under the size budget. Use token splitting only when the natural unit is too large.

For documentation, a page like "Vector stores > File upload > Chunking strategy" should keep that heading path attached to the local paragraph. The chunk can carry headings as metadata, prefix text, or both:

Field	Example
Title	Vector stores
Section	File upload
Subsection	Chunking strategy
Local text	You can customize chunking with `max_chunk_size_tokens` and `chunk_overlap_tokens`...

That heading context matters because many local paragraphs are written with assumed context. A paragraph that starts with "You can customize this..." may embed poorly when detached from the section that defines "this."

For code, syntax-aware chunking usually beats paragraph splitting:

Artifact	Better chunk unit	Metadata to keep
Python module	Function or class, with imports when needed.	File path, symbol name, class, dependency imports.
TypeScript component	Component plus props/types.	Route, export name, related hooks.
API docs	Endpoint section.	Method, path, auth scope, version.
SQL	Query or migration block.	Table names, migration id, schema.

For tables, the unit is often a row plus headers, or a small group of rows plus headers. A chunk containing "$0.10/GB/day" without the storage tier, product, and pricing context is too weak to support a grounded answer.

The same rule applies to PDFs. Page boundaries are implementation details. If the question is answered by a caption, a footnote, or a row that continues across pages, the chunker has to preserve that relationship explicitly.

5. Retrieval Quality Needs Chunk-Level Metrics

A chunking change should be evaluated before it reaches production. The basic question is: for the queries users actually ask, does the retriever return chunks that contain enough evidence to answer correctly?

Start with a small evaluation set:

Field	Example
Query	"Can enterprise customers return engraved hardware?"
Expected source ids	`returns-policy#exceptions`, `returns-policy#enterprise-contracts`
Answerable from one chunk?	No
Required evidence	Standard return window, engraved exception, enterprise override.
Negative distractors	General returns FAQ, warranty policy, shipping exceptions.

Then compare chunking configurations:

Candidate	Configuration
Small precise	400-token chunks, 80-token overlap
Balanced	800-token chunks, 200-token overlap
Boundary-safe	800-token chunks, 400-token overlap
Structure-aware	Section-aware chunks with 15% overlap
Parent-child	Small child chunks retrieved, larger parent sections expanded

Use retrieval metrics before answer metrics:

Metric	What it tells you
Hit rate@k	Did any expected chunk appear in the top k?
Recall@k	How much required evidence was retrieved?
MRR	How early did the first useful chunk appear?
Precision@k	How much of the returned context was useful?
Context recall	Were the answer-supporting claims covered by retrieved context?
Duplicate ratio	Did top-k collapse into repeated overlapping chunks?

LlamaIndex's retrieval evaluator exposes standard ranking metrics such as hit rate, MRR, precision, recall, average precision, and NDCG. Ragas separates context recall into claim support over retrieved context, and also supports non-LLM or ID-based context recall when reference contexts are available.

The key comparison is the metric profile:

Configuration	Typical gain	Typical regression to watch
Small chunks	Precision@5 improves.	Recall@5 drops on multi-hop or exception-heavy questions.
Large chunks	Recall@5 improves.	Precision@5 drops, and answer latency rises from larger context.
High overlap	Hit rate@5 improves.	Duplicate ratio rises, and diverse evidence may drop.

The answer model should be evaluated after the retriever passes this first gate. If the retriever misses the required source, answer grading mostly measures the model's willingness to guess.

6. Parent-Child Retrieval Separates Search From Context

One common fix is to search over small chunks and answer with larger parent chunks.

Object	Role
Child chunk	A small paragraph or passage used for embedding and nearest-neighbor search.
Parent chunk	The surrounding section passed to the generator after a child hit.

This gives the retriever a sharp vector while giving the generator enough context. It is especially useful for:

API documentation where a parameter paragraph needs the endpoint section.
Legal text where a clause needs neighboring exceptions.
Tutorials where one step depends on the previous step.
Tables where a row needs headers and notes.
Code where a function needs class-level state or imports.

The trade-off is context expansion. If every hit expands to a large parent, top-k can become expensive quickly:

\text{prompt tokens} \approx k \cdot P

where $P$ is parent size. With $k=8$ and 1,200-token parents, retrieval can inject 9,600 tokens before the user question, instructions, citations, or conversation history.

A safer policy:

Retrieve 20 child chunks.
Deduplicate by parent id.
Rerank child chunks.
Expand the top four unique parents.
Trim parent text around the matched child when possible.

That policy keeps high recall during search, preserves enough context for generation, and avoids filling the prompt with repeated sibling chunks from the same section.

Parent-child retrieval also improves citations. The child chunk tells you the exact evidence span. The parent chunk gives the model surrounding context. The citation can point to the child span while the answer uses the parent text for disambiguation.

7. Chunking Policy Should Follow Query Distribution

Different products ask different questions of the same corpus.

A developer-docs chatbot sees exact parameter names, error messages, version constraints, and code snippets. A customer-support bot sees policy questions, account-specific constraints, and procedural steps. A research assistant sees broad synthesis queries and multi-hop evidence gathering.

That changes the chunking strategy:

Workload	Better starting strategy
Exact API lookup	Smaller chunks, strong metadata, hybrid search, code-aware splitting.
Support policy QA	Section-aware chunks, moderate overlap, parent expansion.
Research reports	Larger semantic chunks, reranking, summary metadata.
Compliance search	Clause-aware chunks, high boundary preservation, strict citations.
Codebase retrieval	AST/symbol-aware chunks, file path metadata, dependency expansion.
Meeting transcripts	Speaker/time metadata, topic segmentation, summary headers.

The query distribution should be sampled from real traces when possible. When real traces are unavailable, create a synthetic set that includes the queries that usually break retrieval:

exact IDs, SKUs, error codes, and parameter names
"except when" policy questions
comparisons across two sections
table lookups
questions that depend on headings
stale or deleted facts
multi-hop questions where evidence lives in separate chunks

The "Lost in the Middle" long-context result is relevant here. Longer context windows can still use retrieved passages unevenly. Liu et al. found that model performance can change when relevant information moves inside the input context, with weaker use of information in the middle for many evaluated models. Chunking and ranking therefore decide both what evidence enters the prompt and where it lands.

That makes retrieval ordering part of chunking policy. If a large parent expansion pushes the best evidence into the middle of a long context, a nominal recall win can become an answer-quality loss.

8. A Practical RAG Chunking Strategy Tuning Loop

Treat chunking as an experiment with a small grid and a fixed evaluation set.

Start with four candidates:

Candidate	Search chunk	Overlap	Expansion
Small precise	350 tokens	50 tokens	None
Balanced	800 tokens	160 tokens	None
Boundary-safe	800 tokens	320 tokens	Deduplicate top-k
Parent-child	250 tokens	50 tokens	1,000-token parent

Then run the same queries through each candidate and record:

Area	Metrics
Retrieval	Hit rate@5, recall@5, precision@5, MRR, duplicate ratio.
Generation	Answer accuracy, citation precision, unsupported claims, prompt tokens, latency.

A useful decision rule:

Symptom	Adjustment
Recall is low.	Increase context preservation.
Precision is low.	Reduce chunk size or add metadata filters.
Duplicate ratio is high.	Reduce overlap or deduplicate by parent.
Answers cite the right document but miss the answer.	Use parent expansion.
Exact identifiers fail.	Add sparse or hybrid search.
Latency rises faster than quality.	Lower k before shrinking all chunks.

Chunking should also be versioned. Store the chunking policy with each indexed document:

Metadata field	Example
`chunking_policy`	`docs-section-v3`
`chunk_size_tokens`	`800`
`chunk_overlap_tokens`	`160`
`splitter`	`heading_then_sentence`
`source_revision`	`git:8f2c1ab`
`parent_id`	`returns-policy#exceptions`
`chunk_index`	`12`

Without that metadata, retrieval regressions are hard to diagnose. A production incident may look like an embedding model problem when the real change was a parser update that removed headings, a chunk-size change that split tables, or a reindex that dropped parent ids.

9. Decision Rules That Hold Up in Practice

Use these defaults as a starting point, then validate against your corpus:

Situation	Start here
General product docs	600-900 token semantic chunks, 10-20% overlap.
FAQ pages	One entry per chunk, low overlap, strong title metadata.
Dense policy text	Section-aware chunks, 15-30% overlap, parent expansion.
Tables	Row groups with repeated headers and table title metadata.
Code	Symbol-aware chunks with file path, imports, and parent class.
Long research PDFs	Section-aware chunks plus summary metadata and reranking.

Keep chunks self-contained enough to answer a narrow question, but small enough that the embedding still describes a specific idea.

The chunk should usually include:

document title
section path
stable source id
page, line, or byte offsets where available
timestamps for transcripts
permissions and tenant metadata
parent-child relationship
chunking policy version

The chunk should usually avoid:

repeated navigation text
boilerplate footers
unrelated neighboring sections
tables without headers
code fragments without symbols
PDF extraction artifacts
duplicated overlap returned repeatedly to the generator

Those rules are more durable than a single magic chunk size.

10. What to Benchmark Before You Trust a RAG Chunking Strategy

Before promoting a new RAG chunking strategy, run a benchmark that forces the chunker to show its weaknesses.

Use at least these slices:

Slice	Why it catches regressions
Single-fact lookup	Tests precision and citation span quality.
Exception questions	Tests boundary preservation and parent context.
Multi-hop questions	Tests whether related chunks appear together.
Table questions	Tests header retention and row grouping.
Exact identifiers	Tests sparse/hybrid fallback and metadata search.
Long documents	Tests ranking, prompt placement, and context dilution.
Deleted or stale content	Tests source revision and reindex behavior.

Then inspect real retrieved chunks alongside aggregate scores. The most useful debugging view connects each query to the expected source ids, top retrieved chunks, chunk text, parent id, score, metadata filters, final context, answer, and citations.

That trace connects chunking to retrieval and retrieval to generation. It also separates the common cases:

Symptom	Likely chunking issue
Right document, wrong answer	Chunk lacks the specific evidence span.
Right sentence, wrong conclusion	Missing neighboring condition or exception.
Many similar chunks in top-k	Overlap too high or deduplication missing.
Exact product code missed	Dense-only retrieval needs sparse or metadata support.
Table answer wrong	Header or row relationship was split.
Citation points to broad section	Chunk is too large for precise grounding.

Chunking is upstream of every RAG quality metric. The retriever can only rank the objects the ingestion pipeline created.

Conclusion

RAG chunking strategy is the document-design layer of vector search. Chunk size controls recall and precision. Overlap protects boundaries and adds duplicates. Semantic splitting keeps human structure intact. Parent-child retrieval separates search precision from generation context.

The practical path is simple: choose two or three chunking policies, run them against a fixed retrieval eval set, inspect the misses, and promote the policy that improves answerability without flooding the prompt with duplicated or diluted evidence.

References

Patrick Lewis et al. "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." arXiv, 2020. https://arxiv.org/abs/2005.11401
Nelson F. Liu et al. "Lost in the Middle: How Language Models Use Long Contexts." arXiv, 2023. https://arxiv.org/abs/2307.03172
OpenAI. "Retrieval." OpenAI API documentation. https://platform.openai.com/docs/guides/retrieval
LlamaIndex. "Token text splitter." LlamaIndex documentation. https://docs.llamaindex.ai/en/stable/api_reference/node_parsers/token_text_splitter/
LangChain. "Splitting recursively." LangChain documentation. https://docs.langchain.com/oss/python/integrations/splitters/recursive_text_splitter
LlamaIndex. "Retrieval Evaluation." LlamaIndex examples documentation. https://developers.llamaindex.ai/python/examples/evaluation/retrieval/retriever_eval/
Ragas. "Context Recall." Ragas documentation. https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/context_recall/
Pinecone. "Chunking Strategies for LLM Applications." Pinecone Learning Center. https://www.pinecone.io/learn/chunking-strategies/

When RAG Finds the Right Document but Gives the Wrong Answer: Chunk Size, Overlap, and Retrieval Quality