deepseek-v4 serving throughput jumps 5x on gb300 with sglang

source: pytorch blog: serving deepseek-v4 on gb300 with sglang: 5x higher throughput at the same interactivity since day-0

level: technical

the public semi analysis inferencex dashboard shows sglang's deepseek-v4 performance on nvidia gb300 disaggregated lanes improved from about 2,200 tok/s/gpu at day-0 to roughly 11,200 tok/s/gpu by june 2026, a 5x increase at the same user-visible interactivity. both no-mtp and mtp curves lifted across the entire interactivity range, with the june curves sustaining much higher throughput in the high-interactivity region most deployments target. on blackwell ultra aggregated lanes, throughput improved by 2.91x at 30 tok/s/user (no-mtp) and 2.85x at 90 tok/s/user (mtp), with no-mtp peak throughput rising more than 6x due to better recipe dispatch and higher sustainable batch sizes.

kernel optimizations included fusing operations in the mhc pipeline to reduce intermediate tensor traffic, adding kv compression v2 with new c4, c128, and online c128 kernels, and enabling a w4a4 megamoe path that quantizes activations to mxfp4 instead of mxfp8 with negligible accuracy loss. runtime improvements focused on better sliding window attention budgeting and eviction, more accurate preallocation sizing for disaggregated decode, per-concurrency recipe dispatch, and breakable cuda graph support for the prefill path. these changes let decode workers run at higher effective batch sizes and kept gpus busier during prefill.

bug fixes and hardening also contributed significantly. fixes addressed metadata buffer sizing, double-free errors in swa memory, lazy compilation costs via token-bucket prewarm, and a nan issue on blackwell that improved mtp acceptance rates from 0.57 to 0.70. a dynamo fix aligned bootstrap-room generation with prefill dp rank, reducing workload imbalance. together, these corrections removed instability and allowed the serving frontier to hold better performance curves across real concurrency sweeps.

why it matters: these optimizations make large-scale deployment of deepseek-v4 more cost-effective and reliable, enabling higher throughput without sacrificing user experience.

source: pytorch blog: serving deepseek-v4 on gb300 with sglang: 5x higher throughput at the same interactivity since day-0