Hybrid RAG in Practice: Semantic + Keyword Search
How to combine vector search and keyword retrieval for precise technical context in production RAG pipelines.
Hybrid RAG in Practice: Semantic + Keyword Search
Retrieval-Augmented Generation (RAG) stops being "embedding-only" when you need precision on technical terms. Pure semantic search captures paraphrases but may miss acronyms, API names, and rare concepts. The pragmatic fix is hybrid fusion: cosine similarity candidates (HNSW) + full-text search candidates (GIN/tsvector), then rerank.
Why hybrid?
In production, queries mix natural language with jargon. A user asks "how does reranking work in RAG?" — embeddings capture intent, but tsvector ensures chunks with literal "rerank" rise in the ranking. As Chip Huyen explains in AI Engineering (p. 112): retrieval quality dominates end-to-end RAG performance.
Minimal architecture
- Query embedding with E5 prefix (
query: ...) for semantic search. - tsquery in parallel over tsvector indexed at ingest.
- Normalized fusion:
score = w_sem * cos_norm + w_kw * kw_norm. - Heuristic rerank by chapter and diversity.
async def hybrid_search(session, query_vec, tsquery, top_k=6):
sem = await semantic_candidates(session, query_vec, fetch_k=30)
kw = await keyword_candidates(session, tsquery, fetch_k=30)
merged = fuse_normalized(sem, kw, w_sem=0.65, w_kw=0.35)
return rerank_by_chapter(merged)[:top_k]
Grounding in the generator
The LLM must only cite retrieved chunks. Post-generation validation: do EN→PT terms from books appear in the text? Are there page citations? Without this, the model invents generic "retriever + generator" definitions.
def validate_grounding(draft: str, book_chunks: list) -> bool:
hits = sum(1 for c in book_chunks if chunk_has_keyword_hit(draft, c))
return hits >= min(2, len(book_chunks))
Real tradeoffs
| Approach | Pros | Cons |
|---|---|---|
| Semantic only | High recall on paraphrase | Misses exact terms |
| Keyword only | Precision on acronyms | Fragile to synonyms |
| Hybrid | Best of both | More tuning (weights, gates) |
Increase w_kw for technical queries (≥4 informative tokens). Gates prevent weak keyword-only hits from dominating.
Conclusion
Hybrid RAG is not over-engineering — it is the default when technical books have dense vocabulary and queries mix Portuguese with English terms. Start with simple fusion, measure recall on a golden set, then refine weights.
Technical references
- Chip Huyen, AI Engineering (p. 112) — retrieval quality as RAG bottleneck.
- Chip Huyen, AI Engineering (p. 118) — chunking and indexing for recall.
- Lewis et al., Retrieval-Augmented Generation (p. 2) — retriever + generator architecture.