level: research
large language models are getting better at math, but most tests only look at clear problems with final answers or full solutions. they miss how people actually solve new problems together. in real research, people share partial ideas, find mistakes, fix broken logic, and slowly combine small steps into a complete proof. this kind of teamwork is common in open problem solving but not reflected in current benchmarks.
the crowdmath dataset fills this gap. it contains 164 expert-labeled progress chains from the mit primes–art of problem solving crowdmath program, which ran from 2016 to 2025. each chain follows a forum discussion where multiple people work on an open problem until they reach a full proof. every post is tagged with its role in the solution process, such as making partial progress, completing a proof, or pointing out errors. these discussions have led to published research papers.
by showing the full back-and-forth of collaborative reasoning, crowdmath offers a new way to test and train ai models. it moves beyond clean final answers to the real, messy path of discovery. the dataset can help improve how models handle incomplete ideas, spot mistakes, and build on others' work. this is a step toward ai that can take part in genuine research teamwork.
why it matters: it provides a realistic benchmark for ai models to learn and be tested on collaborative, open-ended problem solving, which is closer to how real math research works.