llm-guided autotuning speeds up pytorch kernel tuning

source: pytorch blog: from minutes to seconds: llm-guided autotuning for helion kernels

level: technical

helion, pytorch's domain-specific language for performance-portable machine learning kernels, relies on autotuning to find optimal configurations across large parameter spaces. the default tuner uses likelihood-free bayesian optimization (lfbo), which trains a random forest classifier during search to predict promising candidates. while effective, lfbo still requires hundreds of compile-and-benchmark cycles per kernel, slowing developer iteration and deployment.

a new llm-guided autotuner replaces blind search with an iterative prompt-and-feedback loop. the llm receives the kernel source, input shapes, hardware details, and configuration space, then proposes candidate configs. after benchmarking, the best configs and failure patterns are fed back for refinement rounds until gains stall. on 33 kernel instances across nvidia b200 gpus, the llm approach matches lfbo kernel performance (geomean 1.009x) while benchmarking about 10x fewer configs and reducing wall-clock time by 6.7x. the llm converges to its plateau within roughly the first 7% of lfbo's search budget.

for the few cases where the llm trails lfbo by more than 5%, a hybrid strategy seeds lfbo with the llm's best configs, then runs lfbo refinement. this closes the gap in most cases while remaining about 3x cheaper than full lfbo. testing with claude opus-4.8, gpt-5.5, and claude sonnet-4.6 shows similar performance, indicating the approach is model-agnostic. the practical recipe is to use llm-only search for rapid tuning, then apply hybrid search for maximum performance.

why it matters: faster autotuning reduces compute cost and speeds up development of optimized ml kernels, making production deployment more efficient.

source: pytorch blog: from minutes to seconds: llm-guided autotuning for helion kernels