helion kernels speed up vllm fp8 inference

source: pytorch blog: portable vllm model inference kernels in helion

level: technical

helion is a pytorch-native kernel language that uses a tile-programming model. it offers a familiar syntax for pytorch users while giving low-level control over memory and scheduling. its ahead-of-time autotuning explores many kernel configurations to pick the best one for a given workload and hardware. this work integrates helion kernels into vllm, a popular llm serving framework, to speed up fp8 quantized inference for qwen3 models.

the team replaced most forward-pass kernels in vllm with helion versions, including fused quantization, normalization, and attention preprocessing. they tested on nvidia h100 and b200 gpus. for non-gemm kernels like rms norm plus quantization, helion consistently beat torch inductor and existing cuda kernels, with speedups up to 2.3x. for gemm kernels, helion matched or slightly exceeded cutlass on h100 but fell behind on b200 due to current triton code generation limits on blackwell gpus.

end-to-end benchmarks on qwen3 models showed throughput gains. with per-token quantization, the 1.7b model saw a 1.05x improvement, and the 8b model gained up to 1.09x in speculative decoding scenarios. the largest gains came from combining all helion kernel groups. the integration framework supports autotuning per model shape and runtime dispatching based on batch size, making it practical for production use.

why it matters: faster llm inference reduces serving costs and latency, and helion's pytorch-native approach simplifies kernel development for ai engineers.

source: pytorch blog: portable vllm model inference kernels in helion