source: arxiv statistics ml: prediction-powered inference across many tasks for ai evaluation & social science research

level: research

many applications need valid statistical inference across many related tasks but have only a few high-quality labels per hypothesis. in ai evaluation, tasks can be model behaviors across prompts or subgroups. in social science, they can be related survey questions or populations. prediction-powered inference (ppi) uses cheap proxy measurements to improve inference from limited ground-truth labels. standard ppi methods treat tasks independently and miss shared structure, which hurts when labels per task are very scarce.

the paper introduces a multi-task ppi framework that borrows labeled data from related tasks to increase power while keeping inference task-specific. it exploits correlations among tasks to get more precise estimates. the approach works by combining proxy predictions with a small number of true labels across tasks, using the shared information to reduce variance. this means researchers can draw stronger conclusions from the same limited data.

the method is relevant for ai evaluation where testing many hypotheses with expensive human labels is common. it also applies to social science surveys with many related questions but few respondents. by sharing information across tasks, the framework makes it possible to detect smaller effects or use fewer labels. the authors provide theoretical guarantees and demonstrate gains in simulations and real data.

why it matters: it lets ai evaluators and social scientists get more reliable results from fewer expensive labels by sharing information across related tasks.


source: arxiv statistics ml: prediction-powered inference across many tasks for ai evaluation & social science research