Part IV · Information Retrieval & RAG
Chapter 60 ~13 min read

Hybrid search and fusion

"Lexical search misses the questions where users don't know the right words. Dense search misses the questions where users do. Use both"

We covered BM25 in Chapter 57 and dense retrieval in Chapters 56-57. We made the case that they’re complementary. This chapter is about how to actually combine them: the algorithms, the trade-offs, and the production patterns.

By the end you’ll know how to design a hybrid search system, when to use each fusion strategy, and what the empirical results look like.

Outline:

  1. The complementarity argument.
  2. Sequential vs parallel hybrid.
  3. Reciprocal Rank Fusion (RRF).
  4. Weighted score fusion.
  5. Learning to rank.
  6. SPLADE and other learned sparse retrievers.
  7. The empirical results.
  8. Production architectures.
  9. The decision matrix.

60.1 The complementarity argument

A reminder of the case for hybrid search:

BM25 catches:

  • Exact term matches on rare/specific words.
  • Named entities, product names, technical terms.
  • Code, IDs, structured tokens.
  • Acronyms and abbreviations.
  • Queries where the user wants specific words matched.

Dense retrieval catches:

  • Synonyms and paraphrases.
  • Semantic concepts not literally in the query.
  • Cross-lingual matches.
  • Conversational queries that share no words with the relevant documents.

These are disjoint failure modes. A document that’s relevant via exact term match (BM25 strong) might not be relevant via semantic similarity (dense weak). A document relevant via paraphrase (dense strong) might share no words with the query (BM25 weak).

The empirical observation: on diverse test sets, BM25 and dense retrievers disagree on a substantial fraction of the top-K results. Each finds documents the other misses. Combining them gives a union that’s closer to the true relevance ranking than either alone.

The improvement is typically 5-15 points on retrieval benchmarks (nDCG@10) — a large gain in IR terms. This is why modern RAG systems use hybrid retrieval as the default.

BM25 and dense results overlap partially — hybrid captures their union. BM25-only exact match names, codes Dense-only synonyms paraphrase both find Hybrid (RRF) = union with ranking by agreement
BM25 and dense retrieval find largely disjoint sets of relevant documents — the union is what hybrid delivers, and documents appearing in both get a ranking boost from RRF.

60.2 Sequential vs parallel hybrid

Two ways to structure hybrid retrieval:

Sequential: BM25 first, then dense retrieval over the BM25 candidates (or vice versa).

[Query]
  |
  v
[BM25] → top 1000 candidates
  |
  v
[Dense rerank] → top 100
  |
  v
[Cross-encoder rerank] → top 10

The advantage: each stage narrows the candidate set, so the expensive operations (dense, cross-encoder) only run on a few candidates. The disadvantage: if BM25 misses a relevant document, dense retrieval can’t recover it.

Parallel: BM25 and dense retrieval run independently, and the results are merged.

[Query]
  |
  +--> [BM25] → top 100
  |
  +--> [Dense] → top 100
  |
  v
[Fusion] → top 100 combined
  |
  v
[Cross-encoder rerank] → top 10

The advantage: each retriever sees the full corpus, so neither misses what the other catches. The disadvantage: more compute (two retrievers instead of one) and more complexity (the fusion step).

For most production RAG, parallel hybrid retrieval is the right pattern. The compute overhead is small compared to the LLM cost downstream, and the recall improvement is worth it.

The fusion step (next sections) is where the algorithms get interesting.

60.3 Reciprocal Rank Fusion (RRF)

Reciprocal Rank Fusion (RRF) is the simplest and most popular fusion algorithm. The idea: combine retrievers based on the rank of each document in each retriever’s output, not the score.

The formula: for each candidate document, the RRF score is:

RRF_score(d) = Σ_{retriever r} 1 / (k + rank_r(d))

Where rank_r(d) is the document’s rank in retriever r’s output (1 for the top document, 2 for the next, etc.), and k is a constant (typically 60).

If a document doesn’t appear in retriever r’s top-K, its rank is treated as infinity (the term is 0).

Concretely, suppose BM25 ranks document A as #1 and document B as #5. Dense ranks document A as #4 and document B as #2. With k = 60:

RRF_score(A) = 1/(60+1) + 1/(60+4) ≈ 0.0164 + 0.0156 ≈ 0.0320
RRF_score(B) = 1/(60+5) + 1/(60+2) ≈ 0.0154 + 0.0161 ≈ 0.0315

So A scores slightly higher than B, even though B was higher in dense retrieval. The fusion picks A because it’s high in both, while B is high in only one.

RRF score computation: rank from each retriever contributes 1/(k+rank), summed across retrievers. BM25 rank #1 doc A #2 doc C #3 doc E #4 doc F #5 doc B Dense rank #1 doc D #2 doc B #3 doc C #4 doc A #5 doc G RRF scores (k=60) doc A: 1/61 + 1/64 = 0.0320 doc B: 1/65 + 1/62 = 0.0315 doc C: 1/62 + 1/63 = 0.0320 doc D: 0 + 1/61 = 0.0164 rank-based, not score-based → scale-agnostic
RRF uses only the rank position from each retriever — different score scales don't matter — and rewards documents that appear high in multiple retrievers.

Why RRF works so well:

  • Score-agnostic. It doesn’t matter that BM25 scores are in [0, 50] and dense scores are in [-1, 1]. RRF only uses ranks.
  • Robust. No tuning. The constant k is set empirically; k = 60 works almost universally.
  • Simple. A few lines of code.
  • Empirically strong. Beats more complex fusion methods on most benchmarks.

RRF is the default hybrid fusion algorithm in production. Elasticsearch, OpenSearch, Vespa, Weaviate, and many vector databases all support it natively.

The empirical result: RRF with BM25 + dense gives 5-15 points of nDCG@10 improvement over either retriever alone. The improvement is consistent across benchmarks.

For most production RAG, start with RRF and don’t tune unless you have a reason. It’s hard to beat.

60.4 Weighted score fusion

A different approach: normalize the scores from each retriever and combine them with weights:

fused_score(d) = w_BM25 × normalized_BM25(d) + w_dense × normalized_dense(d)

The challenge is the normalization. BM25 and dense scores live in completely different ranges. You have to:

  1. Normalize each retriever’s scores (typically to [0, 1] via min-max or z-score normalization).
  2. Pick weights for the linear combination.

The advantage over RRF: more granular. You can tune the weights to your specific workload.

The disadvantage: tuning is hard. The right weights depend on the corpus and the query distribution. Default weights of 0.5 / 0.5 work, but they’re not optimal.

Weighted score fusion is more common in research papers than in production. RRF is simpler and works almost as well.

60.5 Learning to rank

The most sophisticated fusion: train a model to combine retriever scores. The model takes (query, document, retriever_scores) as input and outputs a relevance score. The fusion is whatever the model learned.

The training data: (query, document, relevance_label) triples, with retriever scores computed as features. Train a gradient boosting model (XGBoost, LightGBM) or a small neural network to predict the relevance label from the features.

The features can include:

  • BM25 score.
  • Dense retrieval score.
  • Reranker score (if available).
  • Click-through rate (if you have user data).
  • Document age.
  • Document length.
  • Custom features.

Learning to rank can outperform RRF by another 2-5 points if you have:

  • A large amount of training data with relevance labels.
  • A diverse query distribution.
  • Domain-specific signals worth incorporating.

For most teams, the cost of getting training data is too high and RRF is good enough. Learning to rank is for teams with mature retrieval infrastructure and lots of user feedback data (e.g., search engines, e-commerce).

60.6 SPLADE and other learned sparse retrievers

A different angle on hybrid: instead of running BM25 and dense as separate retrievers, train a single retriever that produces sparse vectors that capture both lexical and semantic signals.

SPLADE (Sparse Lexical and Dense, Formal et al., 2021) is the canonical example. SPLADE takes a query/document and produces a sparse vector over the vocabulary, where each entry is the importance of that term (potentially expanded with synonyms learned by the model).

For example, the query “What is the capital of France?” might produce a sparse vector with high weights on:

  • “capital” (the original term)
  • “France” (the original term)
  • “Paris” (the answer, which the model adds)
  • “city” (a related concept)
  • Plus the original term weights from BM25-style scoring.

The retrieval is then: search the corpus’s sparse vectors for the highest dot-product matches. This is a lexical-style search (using inverted indices) but with learned term weights and term expansion, making it somewhere between BM25 and dense retrieval.

SPLADE-style retrievers have several advantages:

  • One index (not two). Simpler operationally.
  • Can use existing inverted-index infrastructure (Elasticsearch, etc.) — they just use sparse vectors instead of plain term frequencies.
  • Strong recall, often beating both BM25 and dense alone.

The disadvantages:

  • Slower than BM25 at indexing time (requires running the model on every document).
  • Larger index than BM25 (more terms per document due to expansion).
  • Less mature ecosystem than BM25 or dense retrieval.

SPLADE is gaining adoption but is still less common than BM25+dense hybrid. For production teams that want simplicity and good recall, BGE-M3 (which produces dense + sparse + multi-vector outputs from a single model) is an interesting option that combines multiple retrieval types in one model.

60.7 The empirical results

Concrete numbers for hybrid retrieval improvement on standard benchmarks:

MS MARCO (the canonical retrieval benchmark):

  • BM25: ~22 nDCG@10
  • Dense (BGE-large): ~38
  • Hybrid (RRF): ~42

BEIR (a diverse retrieval benchmark):

  • BM25: ~42 nDCG@10 (averaged across 18 datasets)
  • Dense (BGE-large): ~50
  • Hybrid (RRF): ~56

HotpotQA (multi-hop questions):

  • BM25: ~57 nDCG@10
  • Dense: ~60
  • Hybrid: ~64

The pattern: hybrid consistently beats either alone, by 5-15 points. The improvement is largest on benchmarks where the queries are diverse and unpredictable (BEIR average) and smaller on benchmarks where one approach is well-suited (HotpotQA, where dense already does well).

For production deployments, expect a 5-10 point improvement from going hybrid. This is large enough to be worth the operational complexity.

60.8 Production architectures

The standard production hybrid retrieval architecture:

[Query]
  |
  +-----+
  |     |
  v     v
[BM25]  [Dense retriever]
(Elastic) (TEI / OpenAI / etc.)
  |     |
  | top 100 | top 100
  |     |
  v     v
  [Fusion (RRF)]
        |
   top 100 candidates
        |
        v
  [Cross-encoder reranker]
   (TEI with bge-reranker)
        |
   top 10 results
        |
        v
  [LLM with context]

The components:

  • BM25 retriever: Elasticsearch, OpenSearch, or a vector DB with BM25 support.
  • Dense retriever: a vector database (Qdrant, Weaviate, Pinecone) with embeddings from a TEI service.
  • Fusion: RRF, computed in application code or by the retrieval layer if it supports it.
  • Cross-encoder reranker: a separate TEI service running a reranker model.
  • LLM: vLLM serving the generation model.
graph TD
  Q[User Query] --> BM25[BM25 — Elasticsearch]
  Q --> DENSE[Dense — Qdrant + TEI]
  BM25 -->|top 100| FUSION[RRF Fusion]
  DENSE -->|top 100| FUSION
  FUSION -->|top 100 merged| RERANK[Cross-encoder Reranker]
  RERANK -->|top 10| LLM[LLM — vLLM]
  LLM --> RESP[Response + citations]
  style FUSION fill:var(--fig-accent-soft),stroke:var(--fig-accent)

Parallel hybrid architecture: both retrievers see the full corpus independently; RRF fusion rewards documents that rank high in both; the cross-encoder makes the final precision cut before the LLM.

This pipeline takes a query, retrieves ~200 candidates (100 from each retriever), fuses them via RRF, reranks the top 100 with a cross-encoder, and feeds the top 10 to the LLM. The total latency is typically 200-500 ms for the retrieval portion, with the LLM adding the rest.

A simpler version skips the cross-encoder reranker for latency-critical applications. A more sophisticated version adds query rewriting (Chapter 63) before retrieval.

60.9 The decision matrix

When to use which fusion approach:

Use caseRecommended
Production RAG, defaultRRF with BM25 + dense
You have lots of training dataLearning to rank
You want a single retrieverSPLADE or BGE-M3
You’re optimizing for simplicityDense only (skip BM25)
You’re optimizing for costBM25 only (skip dense)
You’re at frontier scaleAll of the above with custom tuning

For most production RAG, RRF with BM25 + dense + cross-encoder reranking is the right architecture. Start there; only add complexity if your evaluation says you need it.

60.10 The mental model

Eight points to take into Chapter 61:

  1. BM25 and dense retrieval are complementary. They catch different things.
  2. Parallel hybrid (run both, fuse the results) beats sequential hybrid.
  3. RRF (Reciprocal Rank Fusion) is the simplest and most-used fusion. Score-agnostic, robust, hard to beat.
  4. Weighted score fusion is more granular but harder to tune.
  5. Learning to rank is the most powerful but requires training data.
  6. SPLADE-style learned sparse retrievers combine lexical and semantic signals in one index.
  7. Empirical improvement from hybrid is 5-15 points on standard benchmarks.
  8. Production default: BM25 + dense + RRF + cross-encoder reranker. Then LLM.

In Chapter 61 we look at the most underrated lever in RAG quality: chunking strategies.


Read it yourself

  • Cormack et al., Reciprocal Rank Fusion outperforms Condorcet and individual Rank Learning Methods (2009). The original RRF paper.
  • The Elasticsearch RRF documentation.
  • Formal et al., SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking (2021).
  • The BGE-M3 paper (2024) for the multi-functionality approach.
  • Thakur et al., BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models (2021).

Practice

  1. Implement RRF in 20 lines of Python. Test it on two retrievers’ outputs.
  2. Why is RRF score-agnostic? What’s the practical advantage over weighted fusion?
  3. Construct a query where BM25 catches the right answer and dense doesn’t, and vice versa.
  4. For a query that’s a 4-word product name, which retriever is more likely to win? What if it’s a 20-word natural language question?
  5. Why does learning to rank usually require user click-through data? Could you train it on synthetic data?
  6. Read the SPLADE paper. Explain how it produces sparse vectors and why they capture both lexical and semantic.
  7. Stretch: Set up a hybrid retrieval pipeline with Elasticsearch (BM25) + Qdrant (dense) + RRF on a small corpus. Compare to either retriever alone on a few test queries.