in-kernel broadcast optimization for recsys inference

source: pytorch blog: in-kernel broadcast optimization: co-designing kernels for recsys inference

level: technical

recommendation systems score many candidate items per user request. user features are identical across candidates, but standard inference replicates them to match candidate batch sizes before interaction layers. this replication wastes memory bandwidth and compute, scaling with candidate count. in-kernel broadcast optimization (ikbo) treats broadcast as a data layout issue, not a compute requirement. kernels accept user and candidate inputs at their natural, mismatched batch sizes and handle broadcast internally, so no replicated tensors are ever materialized.

ikbo is deployed across meta's multi-stage recommendation funnel on gpu and mtia accelerators. it serves as the scalability backbone for the meta adaptive ranking model. two kernel examples show the approach: linear compression and flash attention. for linear compression, four progressive co-design stages—matmul decomposition, memory alignment, broadcast fusion, and warp-specialized fusion—yield a cumulative ~4× speedup on h100 sxm5. the flash attention kernel shifts from io-bound to compute-bound, reaching 621 bf16 tflops and delivering 2.4×/6.4× throughput gains over a non-co-designed baseline.

the system design spans kernels, compilation, and runtime. kernels handle mismatched batch sizes internally. the compiler resolves dynamic shapes for operators with multiple batch dimensions. the runtime passes candidate-to-user mappings instead of materializing broadcasts. ikbo can be adopted directly in model code or applied via inference-time transformations that swap standard ops for ikbo equivalents. unlike system-level broadcast or net-splitting, ikbo eliminates replication at the computational primitive layer, achieving dense interaction quality at near-independent cost.

why it matters: reducing redundant data movement and compute in recommendation inference directly lowers latency and infrastructure cost for large-scale ai services.

source: pytorch blog: in-kernel broadcast optimization: co-designing kernels for recsys inference