valid inference after active data collection

source: arxiv statistics ml: post-adc inference: valid inference after active data collection

level: research

statistical inference often fails when data comes from active data collection, where sampling adaptively focuses on promising areas. this problem appears in black-box optimization methods like the tree-structured parzen estimator and gaussian process upper confidence bound. these methods concentrate evaluations in regions that seem optimal, creating biased samples. when analysts later reuse this data for a different inferential task, standard methods give misleading results because they ignore the adaptive sampling process.

the bias gets worse when the inferential target itself is chosen after seeing the data. for example, a researcher might decide to estimate the mean of the best-performing configuration only after running an optimization. this data-dependent target selection compounds the selection bias from active data collection. conventional confidence intervals and hypothesis tests do not account for either source of bias, so their guarantees break down.

the proposed post-adc inference framework corrects for both the adaptive data collection and the data-driven target choice. it provides valid confidence intervals and p-values by modeling the entire data generation process. this allows practitioners to safely draw conclusions from data originally collected for optimization, without needing to collect fresh independent samples. the method works under mild assumptions and can be applied to common sequential model-based optimization algorithms.

why it matters: it lets data scientists reuse optimization data for reliable inference, saving time and resources while avoiding false conclusions.

source: arxiv statistics ml: post-adc inference: valid inference after active data collection