most llm reasoning steps are unnecessary, study finds

source: arxiv artificial intelligence: how much thinking is enough? quantifying and understanding redundancy in llm reasoning

level: research

researchers measured reasoning redundancy in four frontier large language models across two math benchmarks. they defined redundancy as the largest fraction of trailing steps that can be cut while the model, forced to stop thinking and answer, still gets the correct result. across eight model-benchmark pairs, step-level redundancy ranged from 61% to 93%, with medians often above 80%. this means most of the generated reasoning is not needed for accuracy.

the team formalized redundancy using the model's own behavior. they truncated chain-of-thought traces from the end, segment by segment, and checked if the forced final answer matched the original correct answer. the high redundancy held even when controlling for problem difficulty and model size. the findings suggest that current reasoning models spend significant computation on reformulation, verification, and circular self-reflection that does not improve outcomes.

the work provides a first-principles framework to quantify wasted computation in reasoning models. it opens paths for more efficient inference by identifying when to stop thinking early. the authors argue that understanding redundancy can guide training and decoding strategies to reduce latency, gpu time, and energy without hurting performance. the paper includes detailed analysis across models and benchmarks, with code and data released for further study.

why it matters: cutting redundant reasoning steps could slash inference costs and latency for ai systems without losing accuracy, making advanced models cheaper and faster to run.

source: arxiv artificial intelligence: how much thinking is enough? quantifying and understanding redundancy in llm reasoning