tokenspeed hits 580 tps on qwen3.5-397b-a17b for agentic workloads

source: pytorch blog: up to 580tps! new speed record of qwen3.5-397b-a17b on gpu for agentic workloads with tokenspeed

level: technical

tokenspeed, an open-source llm inference engine from the lightseek foundation, set a new speed record of 580 tokens per second running the qwen3.5-397b-a17b model on gpus. the engine targets agentic workloads, where multi-step tool calls and long contexts are common. it achieves high throughput by removing unnecessary memory copies, fusing kernels, and keeping the gpu fully utilized through overlapped cpu-gpu execution.

the system handles the hybrid attention architecture of qwen3.5, which mixes standard full attention layers with gated delta network linear attention layers. tokenspeed manages two separate resource pools: kv cache for full attention and mamba state for linear attention. it supports hybrid prefix caching, allowing reuse of both kv pages and recurrent states across requests. a copy-on-write mechanism ensures cached states remain clean, and chunked prefill with overlap scheduling hides latency.

for prefill-decode disaggregation, tokenspeed transfers mamba states alongside kv caches using the same rdma machinery. a unified step counter tracks progress across all layer types, enabling layerwise transfer that overlaps communication with computation. a three-phase handshake ensures the decode node receives all states and the first token before generation begins. additional optimizations avoid state copies during speculative decoding by using index indirection instead of data movement.

why it matters: faster inference for large hybrid models enables more responsive ai agents and reduces cost per query in production systems.

source: pytorch blog: up to 580tps! new speed record of qwen3.5-397b-a17b on gpu for agentic workloads with tokenspeed