source: arxiv statistics ml: kernel selection is model selection: a unified complexity-penalized approach for mmd two-sample tests

level: research

the maximum mean discrepancy (mmd) is widely used for nonparametric two-sample testing, but its power depends heavily on the kernel. a fixed kernel cannot detect all distribution differences, so data-driven kernel optimization is needed. however, optimizing the kernel on the same data breaks the i.i.d. assumption, creating a trade-off. ratio-based criteria ignore this dependence, leading to overfitting and variance collapse when using rich kernel classes. aggregation methods avoid dependence by restricting to finite grids, but they cannot scale to continuous search spaces like deep kernels.

the new approach frames data-driven kernel selection as a model selection problem. it introduces complexity-penalized mmd (cp-mmd), which applies a two-sample uniform concentration inequality to the post-optimization mmd statistic. this penalty accounts for the complexity of the kernel class, preventing overfitting without needing a separate data split. the method works for any kernel class with a measurable cover, including deep kernels, and provides a principled way to balance fit and complexity.

cp-mmd unifies kernel selection and model selection, offering a scalable solution for modern two-sample testing. it removes the need for restrictive grids or separate validation sets, making it practical for high-dimensional and complex data. the theoretical guarantees ensure valid tests even after data-driven kernel optimization, advancing the reliability of nonparametric hypothesis testing in machine learning and statistics.

why it matters: it enables reliable, automatic kernel choice for two-sample tests, crucial for comparing datasets in ai model evaluation, fairness auditing, and scientific analysis.


source: arxiv statistics ml: kernel selection is model selection: a unified complexity-penalized approach for mmd two-sample tests