Search Quality Metrics: Optimizing One Number Is Dangerous

Buckley and Voorhees (2000) reported that Precision@30 has roughly twice the error rate of Average Precision when comparing systems, meaning an experiment that looks conclusive under one metric can be inconclusive under another on the same data. This is the uncomfortable fact at the heart of search evaluation, and the reason Chapter 11 treats metric selection as a design problem rather than a default. Teams that publish one number and call it "search quality" are not wrong about the number. They are wrong about the coverage.

The metric families do not agree

Five metric families dominate search evaluation. Precision and recall, whose modern formulation was refined by Van Rijsbergen (1979) and extended through decades of subsequent work, describe the two sides of any retrieval result set: what fraction of shown results are relevant, and what fraction of relevant results were shown. NDCG rewards putting the most relevant results at the top and supports graded relevance. MRR focuses on the single first relevant result, which matches question-answering and navigational interfaces. MAP integrates precision across every rank at which a relevant document appears. Recall@k measures candidate-set completeness and ignores position inside that set by design.

These metrics disagree, and the disagreements are neither small nor random. When TREC judgments were reassessed on a graded scale, Sormunen (2002) found that approximately 50% of documents judged relevant under binary assessment were only marginally relevant, and only around 16% were highly relevant. Kekalainen (2005) showed that the correlation between system rankings under binary evaluation and under graded evaluation with strong weighting of highly relevant documents is lower than the correlation between rankings produced by different human assessors applying binary labels. The choice of metric shifts "which system is best" more than the humans doing the judging do.

The cutoff matters as much as the metric

Every ranked-retrieval metric has a cutoff, and cutoffs change the answer. Precision@1 is not Precision@10. Recall@5 and Recall@100 measure different systems. A reranker tested at k=10 and a first-stage retriever tested at k=1000 are optimizing different things. A single number without a cutoff is not comparable to any other single number without a cutoff.

The cutoff should match the consumption pattern. A question-answering interface that displays one answer aligns with Recall@1 or MRR. A product search result page that shows the first page of 24 items aligns with NDCG@10 or Precision@10. A RAG pipeline that feeds 5 passages into the LLM context cares first about Recall@5, because a passage that is not retrieved cannot be generated from.

Segments are where the real regressions hide

Aggregate metrics hide per-segment regressions. A model change that improves overall quality can still move specific query segments in the opposite direction, and the aggregate number will obscure it. Entity-heavy queries, long-tail queries, and queries tied to recent content are classic examples of segments where a "better" model quietly gets worse. The only way to see those movements is to compute metrics on the segments directly rather than on the global pool.

Why a single number is dangerous

The precision-recall trade-off is the most familiar tension, but it is not the only one. Metrics built on different user models can disagree on the same rankings. Metrics differ in statistical power, so one metric may show a significant difference between two systems while another does not. And Goodhart's Law applies: a metric optimized aggressively tends to decouple from the underlying quality it was supposed to proxy for.

The question the article does not answer

Which metric should be primary, which should be secondary, and which should be guardrails? How should a team decide when the primary metric says "ship" and a secondary metric says "regression"? The honest answer is that the decision depends on the product interface, the pipeline stage, and the business cost of different failure modes, and no universal recipe works across those.

Chapter 11 is where the book commits to a framework this article deliberately withheld. It lays out the decision process for matching a metric to a product interface and to a specific pipeline stage, including the separation between candidate generation and reranking. It works through the bounded-recall problem, which explains why a relevant document missing from the first-stage candidate pool cannot be recovered by any downstream component, and why that single fact rearranges the metric hierarchy for RAG and multi-stage ranking. And it defines the three-tier suite (primary, secondary, guardrail) that lets a team ship decisions when the metrics disagree without collapsing back to a single number.

The unresolved question for any team reading this is simple to ask and hard to answer without that framework: given two offline runs that disagree, which metric has the authority to decide?

Search Quality Metrics: Optimizing One Number Is Dangerous

The metric families do not agree

The cutoff matters as much as the metric

Segments are where the real regressions hide

Why a single number is dangerous

The question the article does not answer

Chapter 11: Search Quality Metrics

Laszlo Csontos

Related Posts