source: arxiv artificial intelligence: life after benchmark saturation: a case study of core-bench

level: research

when a benchmark's accuracy saturates, it is usually retired and replaced with a harder one. this approach focuses only on accuracy and misses other important aspects of agent performance. the study uses core-bench hard, a benchmark for computational reproducibility of scientific code, to examine six dimensions beyond accuracy: construct validity issues like shortcuts, out-of-distribution generalization, efficiency, reliability, the role of the model versus the scaffold, and improvement from human-agent collaboration.

the researchers found threats to construct validity in core-bench hard that were hard to see with less capable agents. they created an improved benchmark, core-bench v1.1, and an out-of-distribution task suite, core-bench ood. by measuring agents on these new tasks, they uncovered meaningful performance differences that accuracy alone did not show. for example, some agents took shortcuts that worked on the original benchmark but failed on out-of-distribution tasks.

the study also examined efficiency and reliability, finding that agents varied widely in how many attempts and how much compute they needed to solve tasks. human-agent collaboration boosted performance, but the gain depended on the agent's baseline ability. these findings suggest that benchmark developers should design evaluations that capture multiple performance dimensions, not just accuracy, to better understand and improve ai agents.

why it matters: for ai and data science, this work shows that relying only on accuracy benchmarks can hide critical weaknesses in agent systems, such as shortcut learning or poor generalization, which matter in real-world deployment.


source: arxiv artificial intelligence: life after benchmark saturation: a case study of core-bench