diffusiongemma speeds up local text generation 4x

source: google deepmind: diffusiongemma: 4x faster text generation

level: technical

google deepmind introduced diffusiongemma, an experimental 26b mixture of experts model that generates text using a diffusion process instead of the usual token-by-token approach. it produces 256-token blocks in parallel, reaching over 1000 tokens per second on an nvidia h100 and 700+ on an rtx 5090. the model activates only 3.8b parameters during inference, fitting within 18gb vram when quantized. it is released under an apache 2.0 license for research and development.

the model's bidirectional attention lets each token attend to all others in the block, which helps with tasks like inline editing, code infilling, and structured outputs. it refines its own output iteratively, correcting mistakes by evaluating the whole block at once. however, output quality is lower than standard autoregressive gemma 4 models, so it is not recommended for production where quality is critical. fine-tuning can improve performance on specific tasks, as shown by a sudoku-solving example.

diffusiongemma is designed for local, low-concurrency use where autoregressive models underutilize hardware. by giving the gpu larger chunks of work, it shifts the bottleneck from memory bandwidth to compute. in high-throughput cloud serving, the speed advantage diminishes. the model works with mlx, vllm, hugging face transformers, and will soon support llama.cpp. nvidia optimizations enable fast performance on consumer and enterprise gpus, but unified-memory architectures like apple silicon may not see the same gains.

why it matters: it enables faster local inference for interactive ai applications, making real-time text generation more practical on consumer hardware.

source: google deepmind: diffusiongemma: 4x faster text generation