Hybrid Search vs Vector Search: Where Each Actually Wins
Hybrid retrieval consistently outperforms keyword-only and vector-only on public benchmarks, but the case for hybrid is about complementary failure sets, not averages.
On Natural Questions, 21.1% of questions are answerable by exactly one of BM25 or DPR, never both. On TREC-COVID, the best dense model evaluated in BEIR trailed BM25 by 17.3 nDCG@10 points. Those two numbers are the case for hybrid retrieval in a nutshell: the failure modes of keyword and dense search are structurally complementary, and the gap between them is large enough that picking one is leaving material recall on the table. For the broader arc this sits inside, see the table of contents.
The complementarity is not a hand-wave
A Venn diagram analysis of Recall@10 on Natural Questions showed 32.4% of questions answerable by both BM25 and DPR, 15.2% answerable only by DPR, 5.9% only by BM25, and 46.5% by neither (Lee et al., 2023). The 21.1% of queries answerable by exactly one system is a hard floor on what any single-system approach leaves behind. A BM25-only stack gives up the 15.2%. A DPR-only stack gives up the 5.9%. Only a system that consults both can reach both pools.
The pattern repeats across the 18 datasets of BEIR. BM25 wins on entity retrieval, argument retrieval, biomedical IR, and some news corpora, while dense retrievers win on duplicate question detection and natural-language QA. The best dense model in that evaluation, TAS-B, trailed BM25 by 17.3 nDCG@10 points on TREC-COVID and 7.8 on Touché-2020, but beat BM25 on tasks like Quora, Natural Questions, and MS MARCO, with the largest gap at 18 nDCG@10 points on MS MARCO (Thakur et al., 2021). The gap runs in both directions, and it is large.
Fusion: what actually combines the signals
Three fusion approaches dominate production systems: Reciprocal Rank Fusion, which sums reciprocal ranks across retrievers and needs no score normalization (Cormack et al., 2009); linear interpolation, which takes a weighted sum of normalized scores; and learned fusion, which trains a ranking model on click or judgment data. Each buys more expressive power at the cost of more operational machinery, and the right choice depends on how much labeled data and evaluation infrastructure a team already has.
Most major search platforms default to RRF. Azure AI Search uses RRF with a default k=60 as its core hybrid ranking algorithm. Elasticsearch added RRF as its hybrid fusion method in version 8.8 (2023). OpenSearch added RRF support in version 2.19 (2025). Weaviate and MongoDB Atlas ship rank-fusion operators as well. The presence of RRF everywhere does not mean it is the right answer for every workload; it means it is the lowest-friction starting point.
When hybrid is not worth the complexity
Hybrid retrieval is the correct default for most production search systems, but not for every one. Pure code search, where both queries and documents live inside a tightly constrained vocabulary, often does well with a single specialized retriever. Tiny catalogs with curated synonyms may not justify the operational cost of a second index. Read-mostly archives with well-behaved queries can get a long way on BM25 plus good analyzer work. Systems that deploy off-the-shelf embedding models on highly specialized domains without fine-tuning sometimes find that the semantic retrieval path introduces more noise than signal.
The decision is not "does hybrid help on my benchmark?" The published evidence on that question is one-sided: hybrid beats both components on nearly every benchmark where a reasonable embedding model is available. The real decision is whether the expected quality gain justifies a second index, a fusion stage, monitoring for both retrievers, and the latency budget to run them in parallel.
What the benchmarks don't tell you
Hybrid wins on paper. Choosing between RRF, linear interpolation, and learned fusion, and knowing when hybrid is genuinely not worth the complexity for a specific corpus and query distribution, requires a decision framework the benchmarks alone do not provide. Which fusion method to start with, how to tune it without relevance judgments, when to graduate to the next rung, and how to tell whether a poorly matched embedding model is dragging results down rather than lifting them: those are engineering calls, not retrieval research findings. Chapter 3 works through the framework end to end.
Related chapter
Chapter 3: The Case for Hybrid
Where each retrieval paradigm breaks down is not arbitrary: lexical and dense methods miss different queries in predictable, measurable ways, and the overlap between their failure sets is small. This chapter develops that observation, walks through the three main fusion approaches (Reciprocal Rank Fusion, weighted linear interpolation, and learned fusion), surveys benchmark evidence for hybrid consistently beating either approach alone, and offers a decision framework for deciding when the added complexity earns its keep.
Get notified when the book launches.
Laszlo Csontos
Author of Designing Hybrid Search Systems. Works on search and retrieval systems, and writes about the engineering trade-offs involved in combining keyword and vector search.
Related Posts
Users pick the same word for the same concept less than 20% of the time. Keyword search cannot bridge that gap, and most systems never measure it.
April 20, 2026
A nearest neighbor always exists, so vector search never returns zero results. That is precisely why its failures are so hard to notice.
April 13, 2026