source: pytorch blog: why is pytorch compile so fast: kernel fusion

level: technical

when pytorch runs a model without compilation, each operation like multiply or add becomes a separate gpu kernel. this causes two slowdowns: the overhead of launching many kernels and the cost of writing and reading intermediate results from global memory. inductor, pytorch's compiler, automatically groups dependent operations into single triton kernels. this keeps data in fast registers and cuts kernel launches.

vertical fusion is the most common pattern, linking operations that depend on each other. for example, a sequence of multiply, add, and sigmoid normally uses three kernels with eight memory operations. after fusion, one kernel loads all inputs once, performs all math, and stores only the final result. this eliminates two intermediate buffers and halves memory traffic. other vertical fusion types include reduction fusion for operations like batch normalization, and gemm epilogue fusion that attaches bias and activation to matrix multiplies.

inductor also uses horizontal fusion to run independent operations on the same input together, like computing sine and cosine in one kernel. users can see fusion in action by running a script with torch.compile and setting the torch_logs environment variable to view generated triton kernels. the compiler works automatically, requiring only a decorator to accelerate existing code without changes.

why it matters: kernel fusion reduces gpu memory bottlenecks and kernel launch overhead, directly speeding up model training and inference without manual code changes.


source: pytorch blog: why is pytorch compile so fast: kernel fusion