source: hugging face blog: eva-bench data 2.0: 3 domains, 121 tools, 213 scenarios
level: technical
eva-bench data 2.0 expands from one enterprise domain to three: airline customer service management, enterprise it service management, and healthcare hr service delivery. the release includes 213 evaluation scenarios across 121 tools, a fourfold increase in coverage. each scenario was validated for solvability against three frontier models to ensure the benchmark is challenging and fair. the datasets are open-source and available on hugging face.
scenarios are built around five design principles: voice-first scope, realism, variety, authentication, and reproducibility. they include single-intent calls, multi-intent calls with up to four intents, and adversarial calls where callers try to bypass policies. authentication flows are calibrated to each domain, and every scenario has exactly one correct resolution path to ensure consistent evaluation. user goals are structured as decision trees to make the simulator behave identically across runs.
scenarios are generated jointly using sygra, a graph-based pipeline, with gpt-5.4 as the backbone. each scenario consists of a user goal, an initial database state, and an expected final database state, all produced together to avoid inconsistencies. a multi-stage validation loop checks structural integrity, cross-reference consistency, and policy compliance. manual review and frontier model testing further filtered out ambiguous or unsolvable cases. a multilingual extension is in development to support languages beyond english.
why it matters: it provides a reproducible, domain-diverse benchmark for testing voice agents on real enterprise tasks, helping developers identify failure points like authentication and policy handling before deployment.
source: hugging face blog: eva-bench data 2.0: 3 domains, 121 tools, 213 scenarios