llm reviews show limited alignment with human judgment

source: arxiv artificial intelligence: review arcade: on the human alignment and gameability of llm reviews

level: research

researchers tested how well large language model reviews match human reviews for scientific papers. they used papers from the 2025 acl rolling review process. the alignment between llm and human reviews was limited. in the best case, it was reasonable. but the results changed a lot depending on the prompt and the model used. this means llm reviews are not yet a reliable substitute for human reviewers.

the study also looked at whether authors could game llm reviews. they simulated an author using an llm to revise a paper based on the llm's own review feedback. this iterative draft-revise workflow improved the paper's score from the llm reviewer. the improvement was statistically significant in some settings. this shows a risk that llm reviews can be manipulated by authors who use the same or similar models to polish their submissions.

the findings raise concerns about using llm reviews in real conference settings. if reviews can be gamed, the peer review process may become less trustworthy. the variability across prompts and models also makes it hard to standardize llm reviewing. conference organizers need to be careful when piloting llm reviews. more work is needed to understand when and how llm reviews can be used safely without undermining the fairness of scientific evaluation.

why it matters: llm reviews are being tested at major ai conferences, but this study shows they can be unreliable and easy to manipulate, which could harm the integrity of peer review.

source: arxiv artificial intelligence: review arcade: on the human alignment and gameability of llm reviews