source: hugging face blog: itbench-aa: frontier models score below 50% on the first benchmark for agentic enterprise it tasks — by artificial analysis and ibm

level: technical

artificial analysis and ibm launched itbench-aa, a benchmark for agentic enterprise it tasks starting with site reliability engineering. the sre tasks require models to diagnose kubernetes incidents by reading logs, traces, and topology to identify root-cause entities. the dataset includes 59 tasks, 40 public and 19 held-out, with faults like resource exhaustion and network partitions.

claude opus 4.7 scored highest at 47%, followed by gpt-5.5 at 46% and qwen3.7 max at 42%. all frontier models scored below 50%, making this one of the least saturated agentic benchmarks. more turns did not improve accuracy: gpt-5.5 averaged 31 turns at 46%, while gemini 3.1 pro preview averaged 83 turns at 30%. models that over-investigate often flag upstream mechanisms or co-occurring symptoms as false positives, lowering scores.

open weights models showed strong cost efficiency. gemma 4 31b scored 37% at $0.14 per task, outperforming gemini 3.1 pro preview at $2.23 per task and 30%. glm-5.1 scored 40% at $1.23 per task, matching gemini 3.5 flash on score at lower cost. claude opus 4.7 led at 47% but cost $5.38 per task. scoring uses recall-gated precision: missing any root cause yields zero, and extra entities reduce precision.

why it matters: itbench-aa reveals that current ai agents struggle with real enterprise it diagnosis, highlighting gaps in reliability and cost-effectiveness for production use.


source: hugging face blog: itbench-aa: frontier models score below 50% on the first benchmark for agentic enterprise it tasks — by artificial analysis and ibm