macarena benchmark tests ai agents on macos guis

source: arxiv machine learning: macarena: benchmarking computer use agents on an online macos environment

level: research

computer-use agents that control graphical interfaces through vision and actions have improved quickly, partly due to standardized benchmarks like osworld. however, macos has been largely ignored. the only prior benchmark, macosworld, covers few first-party apps with simple tasks and runs on x86 virtual machines that do not work on apple silicon. macarena fills this gap with 421 manually checked tasks across 50 applications, combining adapted osworld tasks, content from macosworld, and 49 new macos-specific tasks. it runs natively on apple's virtualization framework on apple silicon.

macarena includes tasks that require understanding macos-specific interface elements like the menu bar, dock, and mission control, which are absent in linux-based benchmarks. the benchmark also features multi-step workflows, file system interactions, and use of both first-party and third-party apps. tasks are verified to ensure they are solvable and unambiguous. the environment supports headless execution and provides evaluation metrics based on task completion and efficiency.

initial tests with state-of-the-art agents show that macos presents unique challenges. agents struggled with tasks requiring precise mouse control, understanding overlapping windows, and using keyboard shortcuts. performance varied widely across applications, with simpler tasks like text editing being easier than complex ones like video editing. the benchmark reveals that current agents are far from human-level performance on macos, highlighting the need for better spatial reasoning and platform-specific training.

why it matters: it provides a realistic testbed for developing ai assistants that can operate macos software, which is widely used in creative and professional settings.

source: arxiv machine learning: macarena: benchmarking computer use agents on an online macos environment