Cross-Encoder Reranking: The Highest-Leverage Stage in Hybrid Search

A MiniLM-L6 cross-encoder takes roughly 12 ms to score one document but around 740 ms to score 100 on a GPU (Metarank, 2024). Candidate set size alone decides whether your pipeline lands inside a p99 budget or blows through it. That single measurement reshapes almost every other reranking decision, which is why the book's chapters on hybrid search treat reranking as a stage whose model and depth must be chosen together.

Why cross-encoders dominate on quality

A cross-encoder concatenates the query and candidate document into one sequence and runs full token-level attention between them before producing a relevance score. A bi-encoder instead compresses the document into a single vector before the query is ever seen. Cross-attention lets the model evaluate whether a specific phrase in the document answers a specific part of the query.

The foundational monoBERT paper reported an MRR@10 of approximately 0.365 on MS MARCO passage ranking, against a BM25 baseline of roughly 0.187 (Nogueira and Cho, 2019), close to a doubling of the primary ranking metric from a single reranking pass. The pattern has replicated across BERT, RoBERTa, DeBERTa, and more recent architectures, and across the 18 datasets in the BEIR benchmark cross-encoder reranking holds the top of the average quality ladder (Thakur et al., 2021).

The cost is that the model runs once per candidate rather than once per query. Reranking cost is therefore linear in the candidate count with a large per-candidate constant, since every candidate requires its own transformer forward pass. That scaling with the number of candidates is the whole engineering story of reranking.

The latency budget is the design

A typical production search system operates within a p99 latency budget of 200 to 500 milliseconds. Within that budget, independent measurements for the widely used ms-marco-MiniLM-L6-v2 show roughly 12 ms for a single document, 59 ms for 10 documents, and 740 ms for 100 documents on a GPU (Metarank, 2024). That is why Vespa defaults to 100 candidates for its second-phase reranking and 20 for its most expensive global phase (Vespa docs, 2024), and why the TREC Deep Learning Track standardized on top-100 for its reranking subtask (Craswell et al., 2019).

Three architectural choices fall out of this. First, the first-stage retriever must do enough work that the top-k the reranker sees contains most relevant documents. A reranker cannot recover documents that never entered the candidate set. Second, the candidate set should be as small as the quality curve allows, typically 20 to 100. Third, the reranker family is chosen against a latency target, not the other way around.

Three families, one budget

Three model families compete for that 50 to 250 ms reranking slice, and their trade-offs are not obvious from quality numbers alone.

Cross-encoders like MiniLM and BERT-large variants are the workhorses. They post the highest average quality on BEIR but scale with candidate count through full forward passes.

Late-interaction models like ColBERT and ColBERTv2 precompute document token embeddings and score at query time with a MaxSim operation. On MS MARCO passage reranking, ColBERT matches a BERT-base cross-encoder at MRR@10 of around 0.349 while delivering over 170 times faster reranking and roughly 14,000 times fewer FLOPs per query at k equal to 1000 (Khattab and Zaharia, 2020). ColBERTv2 with residual compression narrows the storage penalty but does not eliminate it (Santhanam et al., 2022).

LLM-based rerankers such as RankZephyr and RankLLaMA reach the best numbers on reasoning-heavy benchmarks. They also run two to three orders of magnitude slower per candidate. Listwise reranking of 100 candidates with RankZephyr 7B on an A100 GPU takes approximately 8.9 seconds per query (Pradeep et al., 2023), which rules out real-time serving for most systems.

The tension is real. A MiniLM-L6 cross-encoder and RankGPT-4 differ by less than one NDCG@10 point on TREC DL19 while differing by about three orders of magnitude in latency. Which of these families fits your system depends on variables that quality leaderboards do not capture.

Practical implication

The useful ceiling on reranking quality depends on at least three interacting variables: the size of the candidate set, the choice of reranker model, and the strength of the first-stage retriever feeding into it. A stronger first-stage retriever concentrates relevant documents in a smaller top-k and shrinks the reranker's marginal contribution. A weaker one demands deeper reranking and a more expensive model to compensate. Domain fit, out-of-domain generalization, and the shape of the latency envelope each push the right answer in a different direction.

There is no single configuration that survives all of those constraints. The combination that actually works for your system depends on which of those variables is binding, and in what order. Chapter 6 walks through how to pick it.

Cross-Encoder Reranking: The Highest-Leverage Stage in Hybrid Search

Why cross-encoders dominate on quality

The latency budget is the design

Three families, one budget

Practical implication

Chapter 6: The Reranking Stage

Laszlo Csontos

Related Posts