source: arxiv machine learning: teaching language models to forecast research success through comparative idea evaluation

level: research

as language models speed up science by generating hypotheses, a new problem arises: how to pick the best ideas from hundreds without testing each one. researchers asked whether language models can forecast the success of research ideas before any experiments. they built a dataset of 11,488 idea pairs from paperswithcode, where each pair has a known winner based on benchmark results. the task is to predict which idea will perform better.

off-the-shelf 8-billion-parameter models only got 30% accuracy, barely above chance. supervised fine-tuning on the dataset raised accuracy to 77.1%, beating gpt-5's 61.1%. the team also used reinforcement learning with verifiable rewards to make the model reason through its choices. this approach reached 71.35% accuracy while producing interpretable justifications, showing that models can learn to evaluate ideas without explicit step-by-step training.

the work turns idea evaluation into a comparative reasoning task. by framing it as a prediction problem with clear outcomes, the model learns to spot patterns that lead to better results. the dataset and methods could help automate the filtering of ai-generated research directions, saving time and compute. the gap between fine-tuned and reasoning-based models suggests there is room to improve how models explain their forecasts.

why it matters: this could help researchers and automated systems quickly filter promising ai-generated ideas, reducing wasted experiments and speeding up scientific discovery.


source: arxiv machine learning: teaching language models to forecast research success through comparative idea evaluation