Chunking and Indexing for RAG: Size, Overlap, and Recall

How to set chunk size, overlap, and dual indexing (embedding + tsvector) to maximize recall in production RAG pipelines — with Python code and real tradeoffs.

Renan Moraes

Chunking and Indexing for RAG: Size, Overlap, and Recall

Before tuning embeddings or rerankers, the silent bottleneck in a RAG pipeline is how you slice knowledge. Chunks that are too large dilute relevance; chunks that are too small break context. Semantic-only indexing misses exact terms. This article covers structure-aware chunking and dual indexing at ingest.

Why chunking dominates recall

The retriever only finds what was indexed. If the right answer is split across two chunks with no overlap, or buried in a 4,000-token block with TOC noise, the LLM never gets useful context. Chip Huyen stresses in AI Engineering (p. 118) that retrieval quality — and chunking is part of it — caps end-to-end RAG performance.

Chunk size: practical rule

There is no universal magic number. For dense English technical books with ~500 tokens per paragraph:

Strategy Chunk size When to use
Precision 800–1200 chars Queries with exact terms, APIs, acronyms
Balanced 1500–2000 chars Tutorials, mixed chapters
Broad recall 2500–3000 chars Concepts spanning sections

In production, 2000 chars with 200 overlap is a solid starting point for technical PDFs — aligned with what Unlocking Data with Generative AI and RAG (p. 307) discusses about preserving semantic units.

Overlap: avoid cutting ideas in half

10–15% overlap relative to chunk size reduces context loss at boundaries. Without overlap, a definition ending in chunk N and continuing in N+1 may rank neither for the full query.

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=2000,
    chunk_overlap=200,
    separators=["\n## ", "\n### ", "\n\n", "\n", " ", ""],
)
chunks = splitter.split_text(markdown_section)

Separator order prioritizes Markdown heading breaks before splitting paragraphs — structure-aware chunking.

Section chunking vs. token-blind

Splitting the entire PDF into fixed windows ignores chapters. Better pipeline:

  1. Extract structured Markdown (headings #, ##, ###).
  2. Group by chapter/section.
  3. Sub-split long sections with a token-aware splitter.
  4. Attach metadata: book_title, chapter, page.

Metadata feeds chapter reranking and (p. N) citations in the generator — a pattern described in LLM Design Patterns (p. 414) for RAG with provenance.

def build_chunks(pages: list[dict]) -> list[dict]:
    """Split by section; track page number per chunk."""
    chunks = []
    for section in detect_sections(pages):
        for piece in splitter.split_text(section["text"]):
            if len(piece.strip()) < 120:
                continue
            chunks.append({
                "content": piece,
                "chapter": section["heading"],
                "page": section["page"],
            })
    return chunks

Dual indexing at ingest

Each chunk needs two complementary indexes:

  1. Embedding (E5 multilingual, passage: prefix) → HNSW for semantic search.
  2. tsvector (english config for EN books) → GIN for keyword search.
async def upsert_chunk(session, chunk, embedding):
    tsv = await session.execute(
        text("SELECT to_tsvector('english', :content)"),
        {"content": chunk["content"]},
    )
    row = BookChunk(
        content=chunk["content"],
        chapter=chunk["chapter"],
        page=chunk["page"],
        embedding=embedding,
        tsv=tsv.scalar(),
    )
    session.add(row)

Without tsvector at ingest, hybrid search at query time is impossible — you rely only on semantic paraphrase and lose acronyms.

Noise filters before indexing

TOC chunks, index pages, and copyright lines pollute the index. Heuristic gates at ingest (TOC line regex with ...., index lines like term, 42) and at runtime (filter_book_noise) prevent garbage from dominating top-k.

Measuring impact

Golden set with technical queries + expected keywords. Metrics:

  • Precision@K: does the keyword appear in top-K chunks?
  • Context recall: do chunk terms appear in the generated draft?

Change chunk size by ±500 chars, re-ingest, compare MRR. Chip Huyen (p. 285) treats indexing and retrieval eval as an iterative cycle — not a one-shot.

Conclusion

Chunking is not an ingest detail — it is an architecture decision. Start with Markdown sections + 2000/200, index embedding and tsvector together, filter noise, measure recall. Only then move to hybrid fusion and rerank.

Technical references

  • Chip Huyen, AI Engineering (p. 118) — retrieval quality and chunking as RAG bottleneck.
  • Chip Huyen, AI Engineering (p. 285) — indexing and iterative retrieval evaluation.
  • Unlocking Data with Generative AI and RAG (p. 307) — semantic units and chunk strategies.
  • LLM Design Patterns (p. 414) — RAG with provenance metadata and rerank.