llm benchmarks miss huge capability gaps

source: arxiv machine learning: the evaluation blind spot: a stereological theory of benchmark coverage for large language models

level: research

researchers applied stereological methods to measure how well llm benchmarks cover model capabilities. they found that any benchmark suite has an effective dimensionality, and the visible difference between two models with the same scores can be large. the bound depends on the number of benchmarks and the effective dimensionality, with a matching lower bound confirming the result is tight.

three independent leaderboards were analyzed: open llm v2, an extended 12-benchmark suite, and livebench. all had effective dimensionality between 2.86 and 4.80 on their competitive frontier. the structural blind spot exceeded the observed runner-up score gap by two orders of magnitude and dominated statistical noise by 52 to 127 times. this means the true capability differences between top models are far larger than what benchmarks show.

simulations using a chi-squared projection model showed that the isotropic prior is the optimistic case. across six hidden-capability priors and four ambient dimensions, the half-split swap rate of the top two models stayed between 0.38 and 0.49. in a 500-trial random split, 92% of trials swapped the top-1 ranking, with an average of 2.83 out of 5 top-5 models changing. this indicates that small changes in benchmark selection can completely reorder rankings.

why it matters: current llm benchmarks are too sparse to reliably rank models, so decisions based on leaderboards may be misleading.

source: arxiv machine learning: the evaluation blind spot: a stereological theory of benchmark coverage for large language models