today's digest covers efforts to make ai benchmarks more honest, a surprising finding about bias in reasoning models, and practical advances in model serving and algorithm design. we also look at new methods for efficient memory use in large language models and a mixture-of-experts approach that learns modular skills.
- adding benchmaxxer repellant to the open asr leaderboardhugging face adds private speech datasets to its asr leaderboard to reduce benchmark gaming and provide a more realistic view of model performance across accents and speaking styles.
- more thinking, more bias: length-driven position bias in reasoning modelsa study finds that longer chain-of-thought reasoning in ai models correlates with increased position bias in multiple-choice questions, challenging the assumption that more thinking reduces shallow biases.
- alphaevolve brings gemini into algorithm designgoogle deepmind's alphaevolve, a gemini-powered coding agent, is now optimizing algorithms across genomics, grid optimization, quantum physics, and commercial applications, showing broad real-world impact.
- ratequant tunes kv cache precision for llmsa new method allocates different bit-widths to attention heads in kv cache quantization, avoiding distortion model mismatch to improve large language model serving efficiency.
- emo: pretraining mixture of experts for emergent modularityemo is a mixture-of-experts model trained so that experts self-organize into task-specific groups, allowing strong performance with only a small subset of experts.
these stories highlight a push for more reliable evaluation, a deeper look at model reasoning flaws, and ongoing work to make ai systems more efficient and adaptable. from speech recognition to algorithm discovery, the field is refining both how we measure progress and how we build practical tools.