level: technical
tlx block attention is a triton kernel designed for nvidia blackwell gpus that targets block-diagonal self-attention, where sequences are split into fixed-size groups that only attend within themselves. this pattern is common in recommendation and feature-interaction models. the kernel uses compile-time knowledge of the attention pattern to eliminate overhead found in general-purpose attention implementations like flash attention v2. on nvidia b200 gpus, it achieves a 1.85x forward speedup and 2.50x backward speedup over flash attention v2, with a 3.5x speedup for the combined attention and rotary backward pass when rotary embeddings are fused.
the key insight is that with a fixed block size, each query tile attends to exactly one key-value tile, collapsing multi-iteration accumulators into single matrix multiplications. this removes the need for online softmax correction, logsumexp storage, and auxiliary kernel launches. the backward pass recomputes attention probabilities inline, avoiding a separate preprocessing step. the kernel uses warp specialization, assigning different warps to load, compute, and store tasks, and employs triple-buffered shared memory and double-buffered tensor memory to keep hardware units busy. it is built with tlx, triton language extensions that expose low-level blackwell features like asynchronous tensor core operations and memory management.
the kernel handles variable-length sequences by launching persistent thread blocks per streaming multiprocessor and balancing work using precomputed tile ranges. it is memory-bandwidth bound, so optimizations focus on latency hiding and reducing memory traffic. the code is available on github, and the approach shows how compile-time pattern knowledge can dramatically simplify attention algorithms for specialized use cases.
why it matters: this kernel can significantly reduce inference and training costs for models using block-sparse attention, such as recommendation systems, by cutting computation time and memory usage.