source: hugging face blog: profiling in pytorch (part 2): from nn.linear to a fused mlp
level: technical
nn.linear wraps a matrix multiply and bias addition but does not launch separate kernels. the bias is folded into the gemm kernel as an epilogue, a small computation done before writing results to gpu memory. this avoids extra memory traffic. the cpu dispatch chain includes an aten::t transpose, but it only rewrites tensor metadata without copying data or launching a gpu kernel.
stacking three linear layers with a geglu activation forms an mlp. profiling shows five gpu kernels per forward pass: three gemms and two pointwise operations for gelu and multiplication. the gemm kernels differ in tile sizes based on input shapes, affecting performance. cublas picks a 128x128 tile for the first two projections and a 128x256 tile for the last, which is faster due to better data reuse.
torch.compile on a single linear layer does not fuse anything because the eager mode already uses a fused addmm kernel. for the mlp, compile removes cpu overhead by precomputing tensor strides and dispatching addmm directly, skipping the transpose view. the gpu kernels remain identical, but cpu launch overhead drops. compile also fuses the pointwise gelu and multiply into a single kernel, reducing total kernel launches from five to three.
why it matters: understanding kernel fusion and cpu overhead helps optimize deep learning models by reducing memory traffic and launch latency, directly improving training and inference speed.
source: hugging face blog: profiling in pytorch (part 2): from nn.linear to a fused mlp