Hybrid RAG in Practice: Semantic + Keyword Search

How to combine vector search and keyword retrieval for precise technical context in production RAG pipelines.

Renan Moraes

Hybrid RAG in Practice: Semantic + Keyword Search

Retrieval-Augmented Generation (RAG) stops being "embedding-only" when you need precision on technical terms. Pure semantic search captures paraphrases but may miss acronyms, API names, and rare concepts. The pragmatic fix is hybrid fusion: cosine similarity candidates (HNSW) + full-text search candidates (GIN/tsvector), then rerank.

Why hybrid?

In production, queries mix natural language with jargon. A user asks "how does reranking work in RAG?" — embeddings capture intent, but tsvector ensures chunks with literal "rerank" rise in the ranking. As Chip Huyen explains in AI Engineering (p. 112): retrieval quality dominates end-to-end RAG performance.

Minimal architecture

  1. Query embedding with E5 prefix (query: ...) for semantic search.
  2. tsquery in parallel over tsvector indexed at ingest.
  3. Normalized fusion: score = w_sem * cos_norm + w_kw * kw_norm.
  4. Heuristic rerank by chapter and diversity.
async def hybrid_search(session, query_vec, tsquery, top_k=6):
    sem = await semantic_candidates(session, query_vec, fetch_k=30)
    kw = await keyword_candidates(session, tsquery, fetch_k=30)
    merged = fuse_normalized(sem, kw, w_sem=0.65, w_kw=0.35)
    return rerank_by_chapter(merged)[:top_k]

Grounding in the generator

The LLM must only cite retrieved chunks. Post-generation validation: do EN→PT terms from books appear in the text? Are there page citations? Without this, the model invents generic "retriever + generator" definitions.

def validate_grounding(draft: str, book_chunks: list) -> bool:
    hits = sum(1 for c in book_chunks if chunk_has_keyword_hit(draft, c))
    return hits >= min(2, len(book_chunks))

Real tradeoffs

Approach Pros Cons
Semantic only High recall on paraphrase Misses exact terms
Keyword only Precision on acronyms Fragile to synonyms
Hybrid Best of both More tuning (weights, gates)

Increase w_kw for technical queries (≥4 informative tokens). Gates prevent weak keyword-only hits from dominating.

Conclusion

Hybrid RAG is not over-engineering — it is the default when technical books have dense vocabulary and queries mix Portuguese with English terms. Start with simple fusion, measure recall on a golden set, then refine weights.

Technical references

  • Chip Huyen, AI Engineering (p. 112) — retrieval quality as RAG bottleneck.
  • Chip Huyen, AI Engineering (p. 118) — chunking and indexing for recall.
  • Lewis et al., Retrieval-Augmented Generation (p. 2) — retriever + generator architecture.