source: hugging face blog: unlocking asynchronicity in continuous batching
level: technical
continuous batching improves gpu use by packing requests tightly, but it still leaves gaps because the cpu and gpu take turns. while the gpu computes, the cpu waits, and while the cpu prepares the next batch, the gpu sits idle. profiling shows that for an 8b model generating 8k tokens with batch size 32, nearly a quarter of total runtime is wasted with an idle gpu. eliminating this overhead could cut generation time from 300 to 228 seconds without changing the model or kernels.
the fix is to run cpu batch preparation and gpu computation at the same time using cuda streams. operations on different streams can overlap, but the default stream forces synchronization, so we use three non-default streams: one for host-to-device transfers, one for compute, and one for device-to-host transfers. cuda events enforce ordering between streams: a stream records an event when a task finishes, and another stream waits for that event before starting. this lets the cpu enqueue all gpu work for batch n and immediately start preparing batch n+1.
the main challenge is avoiding data corruption when preparing batch n+1 while batch n is still using gpu buffers. the solution is to double-buffer inputs and outputs, so the cpu writes to one set while the gpu reads from another. careful synchronization with events ensures the cpu does not overwrite data still in use. this approach keeps the gpu busy almost all the time, turning idle gaps into productive work and significantly increasing throughput for llm serving.
why it matters: higher gpu utilization means serving more requests per dollar, lowering costs for ai inference services.
source: hugging face blog: unlocking asynchronicity in continuous batching