frozen multi-token prediction speeds up gemini nano on pixel

source: google research: accelerating gemini nano models on pixel with frozen multi-token prediction

level: technical

google introduced a method to add multi-token prediction to already deployed gemini nano v3 models on pixel phones. the approach freezes the base model weights and attaches a lightweight transformer head to its final layers. this head predicts multiple future tokens at once, using the main model's hidden states. the base model then verifies these candidate tokens in parallel. incorrect predictions are discarded, so the final output matches the original model exactly.

the design avoids the memory and latency costs of running a separate drafter model. the multi-token prediction head cross-attends to the main model's key-value cache instead of keeping its own. this zero-copy architecture saves about 130 megabytes of memory per instance and removes the need to prefill the drafter with the prompt. the head is trained independently on a frozen backbone, preserving the model's capabilities and safety alignment.

on pixel 9 and 10 devices, the method speeds up token generation by 50 percent or more compared to standalone drafters of similar size. in tasks like notification summaries and proofreading, it predicts nearly two extra tokens per pass on average. this reduces the number of verification steps, cutting energy use and improving battery life. the technique is already rolled out to users, making on-device ai features faster without extra fine-tuning.

why it matters: it enables faster on-device language model inference with lower memory and energy use, directly improving mobile ai features like summarization and proofreading.

source: google research: accelerating gemini nano models on pixel with frozen multi-token prediction