behaviorbench tests ai on real user decisions

source: arxiv artificial intelligence: behaviorbench: modeling real-world user decisions from behavioral traces

level: research

behaviorbench is a benchmark built from real-world behavioral traces. it uses public prediction-market records and on-chain data to reconstruct wallet-level decision histories. the benchmark has two task layers: belief prediction and trade prediction. belief prediction asks models to guess a user's final stance and confidence in a market. trade prediction asks for the direction and amount of individual transactions. the data covers 2,000 evaluation wallets, with 141,445 belief instances and 1,485,972 trade instances.

existing benchmarks for user understanding often rely on simulated users or model-generated behavior. recent work warns that model-based simulations can differ systematically from human behavior. behaviorbench avoids this by using observed public records. this makes it a harder, more realistic test of personalized decision modeling. the benchmark is designed for settings where systems must adapt to individual users, such as decision-support tools.

the benchmark is organized into two complementary layers to test different aspects of user modeling. belief prediction focuses on inferring a user's final position from their history. trade prediction focuses on forecasting specific actions. together, they provide a multi-faceted evaluation of how well a model captures individual decision patterns. the large scale and real-world origin of the data aim to push forward research in personalized ai.

why it matters: it provides a realistic testbed for ai systems that need to understand and predict individual human decisions, moving beyond simulated data.

source: arxiv artificial intelligence: behaviorbench: modeling real-world user decisions from behavioral traces