source: arxiv statistics ml: aura: adaptive uncertainty-aware refinement for llm-as-a-judge auditing
level: research
large language models are often used as judges for open-ended generation because human evaluation is costly and hard to scale. however, their preferences are imperfect proxies for human judgment. existing auditing methods usually assume a reliable subset of examples or clean supervision signals are available beforehand, like human annotations or strong judge outputs. this assumption is fragile in llm evaluation because the initial split may inherit judge bias, and human verification is too scarce to define stable groups at scale.
aura is an adaptive uncertainty-aware refinement framework for auditing pairwise llm-as-a-judge decisions with selected human verification. it iteratively learns a human-consistency signal, propagates reliable evidence, and prioritizes uncertain comparisons for human review. the core idea is to treat trust in a judge's decision as a learnable quantity that improves as more human feedback is collected. aura does not require a pre-existing clean split or large human-labeled dataset.
experiments on summarization and dialogue benchmarks show aura achieves higher agreement with human judgments than baselines while using less human verification. it effectively identifies and corrects biased judge decisions by focusing human effort on the most uncertain comparisons. the framework is model-agnostic and can work with different llm judges and tasks. this approach reduces the cost of reliable evaluation and helps align automated metrics with human preferences.
why it matters: it enables more reliable llm evaluation with limited human effort, reducing bias in automated judging.
source: arxiv statistics ml: aura: adaptive uncertainty-aware refinement for llm-as-a-judge auditing