source: arxiv artificial intelligence: can ai agents synthesize scientific conclusions?
level: research
scientific ai agents are increasingly used to gather evidence and form conclusions for important decisions, but their reliability in fields like health is uncertain. researchers created sciconbench, a benchmark with 9,110 questions and expert-written conclusions from systematic reviews, to test how well agents synthesize open-domain science. they also built sciconharness, a clean-room evaluation tool that gives agents controlled web access to prevent data leakage and ensure fair measurement.
the team evaluated eight frontier models and deep research agents using an automated pipeline that breaks conclusions into atomic facts and scores factual precision and recall. under clean-room conditions, the best agent reached only a factual f1 score of 0.33, showing that even advanced systems produce many incorrect or incomplete claims. performance varied across models, but none came close to human expert quality, highlighting a significant gap in current ai capabilities.
the benchmark covers diverse scientific topics and uses live, up-to-date sources, making it a rigorous test of real-world synthesis. the low scores indicate that current agents often miss key details or introduce errors when combining information from multiple studies. this work provides a standardized way to track progress and identify weaknesses in scientific reasoning, pushing for more trustworthy ai in evidence-based fields.
why it matters: poor factual synthesis by ai agents can lead to flawed decisions in healthcare and policy, so this benchmark helps developers build more accurate and reliable systems.
source: arxiv artificial intelligence: can ai agents synthesize scientific conclusions?