Hybrid Search vs Vector Search: Where Each Actually Wins

On Natural Questions, 21.1% of questions are answerable by exactly one of BM25 or DPR, never both. On TREC-COVID, the best dense model evaluated in BEIR trailed BM25 by 17.3 nDCG@10 points. Those two numbers are the case for hybrid retrieval in a nutshell: the failure modes of keyword and dense search are structurally complementary, and the gap between them is large enough that picking one is leaving material recall on the table. For the broader arc this sits inside, see the table of contents.

The complementarity is not a hand-wave

A Venn diagram analysis of Recall@10 on Natural Questions showed 32.4% of questions answerable by both BM25 and DPR, 15.2% answerable only by DPR, 5.9% only by BM25, and 46.5% by neither (Lee et al., 2023). The 21.1% of queries answerable by exactly one system is a hard floor on what any single-system approach leaves behind. A BM25-only stack gives up the 15.2%. A DPR-only stack gives up the 5.9%. Only a system that consults both can reach both pools.

The pattern repeats across the 18 datasets of BEIR. BM25 wins on entity retrieval, argument retrieval, biomedical IR, and some news corpora, while dense retrievers win on duplicate question detection and natural-language QA. The best dense model in that evaluation, TAS-B, trailed BM25 by 17.3 nDCG@10 points on TREC-COVID and 7.8 on Touché-2020, but beat BM25 on tasks like Quora, Natural Questions, and MS MARCO, with the largest gap at 18 nDCG@10 points on MS MARCO (Thakur et al., 2021). The gap runs in both directions, and it is large.

Fusion: what actually combines the signals

Three fusion approaches dominate production systems: Reciprocal Rank Fusion, which sums reciprocal ranks across retrievers and needs no score normalization (Cormack et al., 2009); linear interpolation, which takes a weighted sum of normalized scores; and learned fusion, which trains a ranking model on click or judgment data. Each buys more expressive power at the cost of more operational machinery, and the right choice depends on how much labeled data and evaluation infrastructure a team already has.

Most major search platforms default to RRF. Azure AI Search uses RRF with a default k=60 as its core hybrid ranking algorithm. Elasticsearch added RRF as its hybrid fusion method in version 8.8 (2023). OpenSearch added RRF support in version 2.19 (2025). Weaviate and MongoDB Atlas ship rank-fusion operators as well. The presence of RRF everywhere does not mean it is the right answer for every workload; it means it is the lowest-friction starting point.

When hybrid is not worth the complexity

Hybrid retrieval is the correct default for most production search systems, but not for every one. Pure code search, where both queries and documents live inside a tightly constrained vocabulary, often does well with a single specialized retriever. Tiny catalogs with curated synonyms may not justify the operational cost of a second index. Read-mostly archives with well-behaved queries can get a long way on BM25 plus good analyzer work. Systems that deploy off-the-shelf embedding models on highly specialized domains without fine-tuning sometimes find that the semantic retrieval path introduces more noise than signal.

The decision is not "does hybrid help on my benchmark?" The published evidence on that question is one-sided: hybrid beats both components on nearly every benchmark where a reasonable embedding model is available. The real decision is whether the expected quality gain justifies a second index, a fusion stage, monitoring for both retrievers, and the latency budget to run them in parallel.

What the benchmarks don't tell you

Hybrid wins on paper. Choosing between RRF, linear interpolation, and learned fusion, and knowing when hybrid is genuinely not worth the complexity for a specific corpus and query distribution, requires a decision framework the benchmarks alone do not provide. Which fusion method to start with, how to tune it without relevance judgments, when to graduate to the next rung, and how to tell whether a poorly matched embedding model is dragging results down rather than lifting them: those are engineering calls, not retrieval research findings. Chapter 3 works through the framework end to end.

Hybrid Search vs Vector Search: Where Each Actually Wins

The complementarity is not a hand-wave

Fusion: what actually combines the signals

When hybrid is not worth the complexity

What the benchmarks don't tell you

Chapter 3: The Case for Hybrid

Laszlo Csontos

Related Posts