source: hugging face blog: the open agent leaderboard

level: technical

most ai evaluations report model scores on single benchmarks. when deploying an agent, you choose a full system: tools, planning, memory, and error recovery. changing any part can alter results and costs. the open agent leaderboard, launched by ibm research, compares entire agent systems across six established benchmarks: swe-bench verified, browsecomp+, appworld, tau2-bench airline and retail, and tau2-bench telecom. it reports success rates and cost per task, showing what works and what is worth deploying.

the leaderboard uses a unified protocol so every benchmark has the same shape: a task, context, and allowed actions. this lets agents keep their native tools while connecting to all benchmarks. early results show that general-purpose agents without task-specific tuning can match or beat specialized systems. model choice is the main factor, but agent architecture matters too. tool shortlisting, which focuses the agent on relevant tools, improved performance across all tested models and turned failing setups into viable ones.

failed runs cost 20 to 54 percent more than successful ones, making failure behavior important for production costs. the project releases the leaderboard, the exgentic framework for running evaluations, and a paper with full methodology. the team invites contributions of new agents, benchmarks, and models. open-weight models like deepseek v3.2 and kimi k2.5 have been added, trailing closed-source models by 18 to 29 percentage points on average. the goal is a shared standard for evaluating open agent systems.

why it matters: it helps ai practitioners choose cost-effective agent setups by showing how system design, not just the model, affects performance across varied tasks.


source: hugging face blog: the open agent leaderboard