nemotron-labs diffusion speeds text generation

source: hugging face blog: towards speed-of-light text generation with nemotron-labs diffusion language models

level: technical

nvidia introduced nemotron-labs diffusion, a family of diffusion language models that generate text by producing multiple tokens at once and then refining them over several steps. unlike autoregressive models that output one token at a time, this approach reduces memory bottlenecks and better uses gpu computation. the models come in 3b, 8b, and 14b sizes, plus an 8b vision-language model, all under open licenses. training code is available through the nvidia megatron bridge framework.

the models support three generation modes in a single checkpoint. autoregressive mode works like a standard left-to-right llm for compatibility. diffusion mode generates blocks of tokens iteratively, achieving up to 2.6 times higher tokens per forward pass than autoregressive models. self-speculation mode uses diffusion to draft candidates and autoregressive decoding to verify them, reaching up to 6.4 times speedup with comparable accuracy. switching modes requires minimal application changes.

nemotron-labs diffusion was trained by adding diffusion capabilities to an existing autoregressive model using a joint objective. it was pretrained on 1.3 trillion tokens and fine-tuned on 45 billion tokens from nvidia datasets. deployment will be supported in sglang, allowing easy mode selection via configuration. on b200 hardware, self-speculation reached about 865 tokens per second, roughly four times the autoregressive baseline.

why it matters: faster text generation with flexible modes can reduce latency and cost for ai applications, especially when serving single queries or small batches.

source: hugging face blog: towards speed-of-light text generation with nemotron-labs diffusion language models