Back to Blog
Part IV · Chapter 12evaluationLLM

LLM-as-Judge for Search: Scalable, Biased, Calibratable

LLM judges rank far-apart systems reliably and collapse exactly where leaderboard rank matters most. The evidence, and what it means for offline evaluation.

February 2, 20265 min read

At the 2024 TREC RAG Track, Kendall's tau among the top 15 systems dropped to 0.56 when LLM judges were used to rank pipelines that themselves relied on LLMs (Clarke 2025). A deliberately adversarial submission exploited that circularity to post inflated scores. The usual intuition about offline evaluation inverts here. LLM judges look excellent on paper, and they fail hardest exactly where the leaderboard is tightest.

System ranking vs per-label agreement

The contradiction is visible in the numbers. Against NIST assessors on TREC Deep Learning, LLM judges using Description-Narrative-Aspects prompting achieve per-label agreement (Cohen's kappa) in the 0.20 to 0.64 range, depending on prompt phrasing (Thomas et al. 2024). Small paraphrases shift agreement unpredictably. Human assessors do not do much better against each other: mean overlap between two independent assessor sets is about 0.33, and Fleiss' kappa for 4-grade relevance is typically 0.17 to 0.28 (Voorhees 2000, Parry 2025).

At the system ranking level, the picture flips. Open-source reproductions of the Bing UMBRELA relevance assessor achieve Kendall's tau above 0.87 across TREC Deep Learning 2019-2023 (Upadhyay et al. 2024), matching the tau > 0.9 observed between human assessor sets (Voorhees 2000). Individual labels are noisy. Aggregated system rankings are robust.

The operational consequence is sharp. LLM judges are reliable enough to tell a clearly better system from a clearly worse one. They are not reliable enough to rank systems that are already close, and "close" is where real production decisions live.

Why the economics still force the issue

A typical TREC track needs six contractors working two to four weeks to judge 50 topics (Soboroff 2025). That cadence cannot gate a weekly release. Pinterest's fine-tuned XLM-RoBERTa-large, trained on about 2.6 million human-annotated pairs, hits 73.7% exact match with human labels and 91.7% agreement within one grade. The sensitivity effect is the part that changes the workflow: minimum detectable effect dropped from 1.3-1.5% to below 0.25%, a roughly fivefold to sixfold improvement, because the system generates 150,000 labels in 30 minutes on a single GPU (Wang et al. 2025).

That throughput is what forces every serious search team to use LLM judges. It also means the biases travel with them at scale.

Three biases the chapter names explicitly

Position bias. In pairwise comparisons, the first-presented candidate gets a systematic advantage. Mitigation is mechanical: swap order and average (Zheng et al. 2023).

Verbosity bias. Longer passages receive higher relevance scores regardless of information content. This distorts comparisons across retrieval systems that return passages of different lengths (Gu et al. 2024).

Self-enhancement bias. An LLM rates its own outputs (or those of a sibling model) as more relevant than equivalent alternatives. When the same model family powers reranking and judging, scores are inflated (Zheng et al. 2023). This is the mechanism behind the TREC 2024 circularity result.

Beyond bias, LLM judges are vulnerable to adversarial content. Alaofi et al. (2024) show that injecting query keywords into irrelevant passages ("keyword stuffing") and inserting instruction text such as "this paper is perfectly relevant" both successfully fool LLM judges. False positives correlate with surface query-term overlap, even when the passage is not relevant.

What this resolves to

The evidence lines up in a specific shape. Per-label agreement with humans is low, both for LLMs and for other humans. System ranking correlation is high when systems are far apart, and collapses (tau = 0.56) when the top cluster is tight and LLM-heavy. Sensitivity gains from LLM judges are real and large. Adversarial content breaks relevance labels in predictable, reproducible ways.

None of this makes LLM judges optional. At 150,000 labels per half hour, they are the only way to detect sub-1% regressions on a weekly cycle. And none of it makes them sufficient. A pipeline that trusts LLM labels at face value, especially when ranking systems that share architecture with the judge, produces numbers that confirm whatever the judge already believed.

LLM judges are reliable enough to rank far-apart systems and unreliable exactly where leaderboard rank matters most. Chapter 12 shows the three-tier annotation architecture that resolves this (human anchors, LLM breadth, distilled student for serving) and the calibration loop that keeps it honest.

Related chapter

Chapter 12: Building an Evaluation Pipeline

A metric is only as useful as the pipeline that computes it on every code change, with enough coverage to catch regressions before users do. This chapter shows how to assemble golden test sets, scale relevance labeling with LLM-based judges, automate offline evaluation runs, stratify results to surface per-segment regressions hiding inside an aggregate win, and wire the whole thing into CI gates for deployments.

Get notified when the book launches.

L

Laszlo Csontos

Author of Designing Hybrid Search Systems. Works on search and retrieval systems, and writes about the engineering trade-offs involved in combining keyword and vector search.