Chunking and Indexing for RAG: Size, Overlap, and Recall
How to set chunk size, overlap, and dual indexing (embedding + tsvector) to maximize recall in production RAG pipelines — with Python code and real tradeoffs.
Chunking and Indexing for RAG: Size, Overlap, and Recall
Before tuning embeddings or rerankers, the silent bottleneck in a RAG pipeline is how you slice knowledge. Chunks that are too large dilute relevance; chunks that are too small break context. Semantic-only indexing misses exact terms. This article covers structure-aware chunking and dual indexing at ingest.
Why chunking dominates recall
The retriever only finds what was indexed. If the right answer is split across two chunks with no overlap, or buried in a 4,000-token block with TOC noise, the LLM never gets useful context. Chip Huyen stresses in AI Engineering (p. 118) that retrieval quality — and chunking is part of it — caps end-to-end RAG performance.
Chunk size: practical rule
There is no universal magic number. For dense English technical books with ~500 tokens per paragraph:
| Strategy | Chunk size | When to use |
|---|---|---|
| Precision | 800–1200 chars | Queries with exact terms, APIs, acronyms |
| Balanced | 1500–2000 chars | Tutorials, mixed chapters |
| Broad recall | 2500–3000 chars | Concepts spanning sections |
In production, 2000 chars with 200 overlap is a solid starting point for technical PDFs — aligned with what Unlocking Data with Generative AI and RAG (p. 307) discusses about preserving semantic units.
Overlap: avoid cutting ideas in half
10–15% overlap relative to chunk size reduces context loss at boundaries. Without overlap, a definition ending in chunk N and continuing in N+1 may rank neither for the full query.
from langchain_text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=2000,
chunk_overlap=200,
separators=["\n## ", "\n### ", "\n\n", "\n", " ", ""],
)
chunks = splitter.split_text(markdown_section)
Separator order prioritizes Markdown heading breaks before splitting paragraphs — structure-aware chunking.
Section chunking vs. token-blind
Splitting the entire PDF into fixed windows ignores chapters. Better pipeline:
- Extract structured Markdown (headings
#,##,###). - Group by chapter/section.
- Sub-split long sections with a token-aware splitter.
- Attach metadata:
book_title,chapter,page.
Metadata feeds chapter reranking and (p. N) citations in the generator — a pattern described in LLM Design Patterns (p. 414) for RAG with provenance.
def build_chunks(pages: list[dict]) -> list[dict]:
"""Split by section; track page number per chunk."""
chunks = []
for section in detect_sections(pages):
for piece in splitter.split_text(section["text"]):
if len(piece.strip()) < 120:
continue
chunks.append({
"content": piece,
"chapter": section["heading"],
"page": section["page"],
})
return chunks
Dual indexing at ingest
Each chunk needs two complementary indexes:
- Embedding (E5 multilingual,
passage:prefix) → HNSW for semantic search. - tsvector (
englishconfig for EN books) → GIN for keyword search.
async def upsert_chunk(session, chunk, embedding):
tsv = await session.execute(
text("SELECT to_tsvector('english', :content)"),
{"content": chunk["content"]},
)
row = BookChunk(
content=chunk["content"],
chapter=chunk["chapter"],
page=chunk["page"],
embedding=embedding,
tsv=tsv.scalar(),
)
session.add(row)
Without tsvector at ingest, hybrid search at query time is impossible — you rely only on semantic paraphrase and lose acronyms.
Noise filters before indexing
TOC chunks, index pages, and copyright lines pollute the index. Heuristic gates at ingest (TOC line regex with ...., index lines like term, 42) and at runtime (filter_book_noise) prevent garbage from dominating top-k.
Measuring impact
Golden set with technical queries + expected keywords. Metrics:
- Precision@K: does the keyword appear in top-K chunks?
- Context recall: do chunk terms appear in the generated draft?
Change chunk size by ±500 chars, re-ingest, compare MRR. Chip Huyen (p. 285) treats indexing and retrieval eval as an iterative cycle — not a one-shot.
Conclusion
Chunking is not an ingest detail — it is an architecture decision. Start with Markdown sections + 2000/200, index embedding and tsvector together, filter noise, measure recall. Only then move to hybrid fusion and rerank.
Technical references
- Chip Huyen, AI Engineering (p. 118) — retrieval quality and chunking as RAG bottleneck.
- Chip Huyen, AI Engineering (p. 285) — indexing and iterative retrieval evaluation.
- Unlocking Data with Generative AI and RAG (p. 307) — semantic units and chunk strategies.
- LLM Design Patterns (p. 414) — RAG with provenance metadata and rerank.