the open asr leaderboard now includes private english speech datasets from appen and dataoceanai. these cover scripted and conversational speech with american, australian, canadian, indian, and british accents. the data is kept private to prevent benchmaxxing, where models are optimized for public test sets without real-world gains. by default, the average word error rate is still computed only on public data, but users can toggle the private sets to see their impact.

the new private data tab shows metrics like average wer, scripted vs. conversational performance, and us vs. non-us accents. scores are not given per split to avoid targeted optimization. having multiple data providers balances any advantage a model might have from training on similar data. the leaderboard's default ranking excludes private sets, so they do not affect the main leaderboard order unless users choose to include them.

this matters for ai because it addresses goodhart's law: when a measure becomes a target, it stops being a good measure. private datasets make it harder to game the benchmark, leading to more trustworthy evaluations. for data science, it highlights the need for diverse, real-world test conditions and shows how to design benchmarks that resist overfitting. future updates may add noisy conditions and more data providers.


source: hugging face blog: adding benchmaxxer repellant to the open asr leaderboard