Embedding Model Selection: MTEB Rank Is Not Enough

The embedding model is the single most consequential and least reversible decision in a hybrid pipeline, one of the core topics across the book chapters. Once documents are embedded, switching models means recomputing every vector in the index, which scales linearly with corpus size and can take days or weeks for large collections. Many teams default to the top of the MTEB leaderboard. That is usually the wrong move.

Benchmarks measure averages, not your workload

MTEB is the standard leaderboard and aggregates performance across dozens of tasks. The BEIR subset of MTEB covers 18 retrieval datasets spanning entity retrieval, argumentation, biomedical IR, news, and semantic QA. The issue is not that these benchmarks are wrong, it is that aggregate rank on heterogeneous tasks is a weak predictor of performance on your specific corpus and query distribution.

The original BEIR paper made this concrete. DPR, despite strong in-domain performance on Natural Questions, underperformed BM25 on nearly all 18 BEIR datasets once moved out of its training distribution (Thakur et al., 2021). TAS-B underperformed ANCE on TREC-COVID by 17.3 nDCG@10 points despite outperforming the same model on 14 other datasets. Per-dataset variance is extreme for every model tested, which means a model two positions down the leaderboard can beat the top model on the tasks that matter to you.

The strongest evidence for this thesis comes from a direct test. Tang et al. (2024) evaluated seven state-of-the-art embedding models on FinMTEB, a finance-specific benchmark of 64 datasets. The Spearman rank correlation between MTEB and FinMTEB rankings was not statistically significant at p=0.05. Every model dropped substantially on the financial domain, and MTEB rank simply did not predict which model would do best. ANOVA confirmed the domain factor was highly significant across all task types. If you are choosing an encoder for a specialized corpus, the leaderboard is telling you very little.

Commercial models require extra skepticism

Commercial embedding APIs publish their own numbers. OpenAI's text-embedding-3-large reports 64.6% on MTEB and 54.9% on MIRACL (OpenAI, 2024). Cohere's Embed v4 and Voyage's voyage-3-large publish competitive numbers on their own evaluation suites. Google's gemini-embedding-001 came to GA in 2025 with similar claims.

Voyage evaluates on a proprietary 100-dataset collection rather than standard MTEB, and none of the major commercial providers publish formal model architectures. Treat commercial benchmarks as directional. Their results are not necessarily inflated, but they are not independently reproduced either.

Dimensionality is a cost lever, not a quality lever

Embedding dimensionality drives storage, query-time cost, and ANN index memory. Higher-dimensional embeddings do not monotonically improve retrieval quality once you pass a threshold that depends on the domain. The post on vector cost optimization goes into Matryoshka embeddings, which allow truncation to smaller dimensions with measured quality loss. OpenAI's text-embedding-3-large retains roughly 93% of retrieval performance when truncated from 3072 to 256 dimensions, and that 256-dimensional vector still outperforms the full 1536-dimensional ada-002. For model selection, the important move is to compare candidates at a dimensionality you can actually afford to deploy, not the maximum the model supports.

Evaluate on your data, with discipline

The gap between a leaderboard rank and a production decision is closed by evaluation on data that represents your actual workload. A minimum of 50 to 100 queries is the floor for statistically meaningful comparisons in classical IR evaluation, and more is always better. Below that floor, score differences between candidate models are noise.

A disciplined evaluation process is essential, and it is more than picking a metric. It requires controlling for variables outside the model itself, choosing a metric that matches the pipeline stage the embedding model plays, testing for statistical significance rather than eyeballing deltas, and measuring the model's contribution inside the full hybrid pipeline rather than on a vector-only shootout. The specifics of how to run each of those stages, including how to build the evaluation set, which metric to use when, and which statistical tests hold up on the sample sizes that IR practitioners actually collect, are the subject of the full chapter.

The open question

Benchmark rank does not predict your domain. Vendor numbers are directional. Dimensionality is a cost dial, not a quality dial. So how do you actually choose, with rigor, on a corpus no leaderboard has ever seen? That is what the chapter answers.

Embedding Model Selection: MTEB Rank Is Not Enough

Benchmarks measure averages, not your workload

Commercial models require extra skepticism

Dimensionality is a cost lever, not a quality lever

Evaluate on your data, with discipline

The open question

Chapter 8: Embedding Model Selection

Laszlo Csontos

Related Posts