Go Back
M

MH (Frank) Tsai

AI Solutions Architect

RAG with Hybrid Search in Production: Lessons from Serving Millions of Queries

We shipped a retrieval-augmented generation system that serves millions of search queries across multiple channels. The first version used pure vector search. It was elegant, easy to reason about — and consistently returned wrong results for 30% of queries. This post covers what we learned building a hybrid search system that actually works in production.

Why Pure Vector Search Wasn't Enough

Vector search is great at semantic similarity. Ask "cozy Italian place for a date" and it surfaces restaurants described as "romantic trattoria with candlelit ambiance." But it fails at things BM25 handles trivially:

  • Exact name matches. Searching for "Da Wan" (a specific restaurant name) returned semantically similar but wrong results because the embedding space doesn't privilege exact string matches.
  • Attribute filtering. "Restaurants open on Monday" isn't a semantic concept — it's a structured data query. Vector search treated it as vibes.
  • Rare terms. Niche cuisine types or neighborhood names have weak embedding representations. BM25's term frequency approach handles them better.

The failure pattern was predictable: vector search worked well for exploratory queries ("somewhere nice for brunch") but failed for precise ones ("Tsuta ramen in Xinyi district").

The Hybrid Architecture

We settled on Elasticsearch with both BM25 and dense vector fields:

User Query
    │
    ├──> BM25 Search (text fields: name, description, reviews, menu)
    │         │
    │         ▼
    │    BM25 Results + Scores
    │
    ├──> Vector Search (embedding field: 768-dim dense vector)
    │         │
    │         ▼
    │    Vector Results + Scores
    │
    └──> Score Fusion Layer
              │
              ▼
         Merged & Reranked Results
              │
              ▼
         LLM Generation (with top-k context)

Each document in the index carries both traditional text fields (analyzed with custom tokenizers for CJK and English) and a dense vector field storing the document embedding.

Chunking Decisions

The chunking strategy took three iterations to get right:

V1: Full document embeddings. Each restaurant got one embedding from its entire description + reviews. Problem: long documents diluted the signal. A restaurant with 50 reviews about service quality and one review about their signature dish would have an embedding dominated by "service," making it invisible for dish-specific queries.

V2: Fixed-size chunks (512 tokens). Better retrieval precision, but chunks broke mid-sentence and lost context. A review saying "the pasta was incredible but the wait was terrible" might get split, and the chunk with "the wait was terrible" would surface for negative sentiment queries about the wrong aspect.

V3: Semantic chunking with metadata (current). We chunk by logical unit — each review is one chunk, menu sections are separate chunks, the venue description is one chunk. Each chunk carries parent document metadata (restaurant name, location, cuisine type) so the LLM always has context even from a partial match. This increased our index size by ~40% but improved retrieval precision significantly.

Score Fusion: Reciprocal Rank Fusion vs. Weighted Combination

We tested two fusion strategies:

Weighted linear combination: final_score = alpha * bm25_score + (1 - alpha) * vector_score. Simple, but BM25 and vector scores live on different scales. Normalizing them introduced its own problems — a BM25 score of 15.3 and a vector cosine similarity of 0.87 don't map to the same "confidence" even after min-max normalization.

Reciprocal Rank Fusion (RRF): RRF_score = sum(1 / (k + rank_i)) across both result lists. This approach only cares about ranking position, not raw scores. A document ranked #1 by BM25 and #5 by vector search gets a higher RRF score than one ranked #3 by both.

We went with RRF (k=60) because:

  • It's score-agnostic — no normalization gymnastics
  • It naturally handles the case where one retriever doesn't return a document at all
  • It's deterministic and easy to debug: you can trace exactly which ranker contributed what

The Reranking Layer

After fusion, we pass the top-20 results through a cross-encoder reranker. This is the most expensive step (~150ms) but delivers the biggest quality improvement.

The reranker sees the full query alongside each candidate document and produces a relevance score that captures nuances the bi-encoder embeddings miss. For example, "quiet place for working" — the bi-encoder might match "quiet restaurant" and "co-working space" equally, but the cross-encoder understands the user wants a quiet restaurant suitable for working.

Latency budget: | Step | p50 | p95 | |------|-----|-----| | BM25 retrieval | 8ms | 25ms | | Vector retrieval | 12ms | 40ms | | RRF fusion | 1ms | 2ms | | Cross-encoder rerank (top-20) | 80ms | 150ms | | Total retrieval | ~100ms | ~220ms |

The reranker is the bottleneck. We considered cutting it but A/B testing showed a 15% improvement in answer relevance with it enabled. We kept it and optimized by limiting reranking to 20 candidates instead of 50.

Embedding Pipeline

Documents flow through an async embedding pipeline:

  1. Data ingestion — new or updated restaurant data arrives via webhook or batch sync
  2. Preprocessing — text cleaning, language detection, CJK tokenization
  3. Embedding generation — batched calls to the embedding model (we use a 768-dim model)
  4. Index update — Elasticsearch bulk upsert with both text and vector fields
  5. Validation — spot-check retrieval quality on a held-out query set

The pipeline runs on Pub/Sub — each data change event triggers an embedding job. Average end-to-end latency from data change to searchable embedding: ~45 seconds.

Evaluation with Promptfoo

We built a retrieval evaluation suite using promptfoo with three test dimensions:

  1. Retrieval precision. Given a known-good query-document pair, does the system retrieve that document in the top-5? We maintain a golden dataset of 200+ query-document pairs, updated monthly.

  2. Answer faithfulness. Does the LLM's generated answer actually reflect the retrieved context? We check for hallucination by verifying claims against source documents.

  3. Negative testing. When the knowledge base doesn't contain an answer, does the system say "I don't know" instead of hallucinating? We maintain a set of unanswerable queries that should trigger abstention.

Running this suite on every prompt change catches regressions before they hit production. We've caught 12 regressions in the past year that would have shipped without it.

Lessons Learned

  1. Hybrid search isn't optional for production RAG. Pure vector or pure keyword will fail for significant query segments. The question isn't whether to go hybrid, but how to fuse the signals.

  2. Chunking is the highest-leverage decision. We spent more time on chunking strategy than on model selection. Bad chunks mean even perfect retrieval returns useless context.

  3. Measure retrieval independently from generation. When answer quality drops, you need to know if it's a retrieval problem or a generation problem. We log retrieved documents separately from generated answers.

  4. Cache aggressively. Popular queries repeat. We cache the full retrieval result (not just embeddings) with a 15-minute TTL in Redis. Cache hit rate: ~35%, which meaningfully reduces both latency and cost.

  5. Start with BM25 and add vectors, not the other way around. BM25 is fast, debuggable, and handles most queries adequately. Add vector search for the semantic queries that BM25 misses. This incremental approach lets you measure the actual value vectors add.

2026 ❤️ MH (Frank) Tsai