source: pytorch blog: tokenspeed-kernel: portable apis and high-performance kernels for multi-silicon llm inference

level: technical

tokenspeed-kernel is a standalone, open-source subsystem that addresses backend complexity in llm inference. it provides a clean, layered api and registry system that separates the high-level runtime from low-level, hardware-specific code. the runtime calls generic public apis for operations like attention and mixture of experts, while the kernel system selects the best implementation for the current platform. this design keeps model code portable and kernel development specialized, avoiding scattered hardware checks and backend logic leaks.

the kernel system uses a registry and selection mechanism. kernel authors register implementations with metadata describing operator family, platform capabilities, tensor signatures, and priority. at runtime, the selector filters and ranks candidates based on the request and platform, returning a callable. this supports multi-silicon targets, allows portable triton paths alongside optimized kernels like gluon for amd, and enables fast iteration with numerics checks, standalone benchmarks, and profiling. out-of-tree plugins can also register kernels through the same decorator.

a practical example is gpt-oss 120b on amd mi355x. the model uses attention with sinks and sliding windows, and mixture of experts with mxfp4 weights and fp8 activations. tokenspeed-kernel keeps these details below the public api, so model code does not need to know hardware specifics. for amd, performance-critical kernels are implemented in gluon, a triton-family dsl that gives explicit control over cdna4 features like async copies, shared memory layouts, and matrix core operations. this allows deep optimization for decode-phase latency without complicating the runtime.

why it matters: it simplifies building and maintaining fast llm inference across different hardware by providing a clean boundary between model code and platform-specific optimizations, reducing engineering overhead.


source: pytorch blog: tokenspeed-kernel: portable apis and high-performance kernels for multi-silicon llm inference