mobile npu framework speeds up diffusion llm inference

source: arxiv machine learning: efficient on-device diffusion llm inference with mobile npu

level: technical

diffusion large language models generate multiple tokens at once, which helps reduce delays on mobile devices. but the repeated denoising steps demand heavy computation. mobile neural processing units can handle dense matrix math quickly, yet using them for these models is hard. as tokens get fixed during generation, the work per block drops. reusing key-value caches is tricky because tokens can change. also, the small memory that npus can see leads to extra data copying and remapping costs.

researchers built llada.cpp, the first framework that makes diffusion llms run well on smartphone npus. it matches the block-by-block inference to how mobile npus work best. one key method is multi-block speculative decoding. when the current block has fewer tokens left to process, it fills the idle npu capacity by guessing tokens for future blocks. this keeps the npu busy and avoids wasted compute.

the framework also tackles the other two problems. it handles token revision in a way that allows better kv cache reuse, cutting down repeated work. for the limited npu address space, it reduces the need for costly remapping and data transfers. together, these changes let diffusion llms run faster and more efficiently on phones, making advanced language features more practical without cloud reliance.

why it matters: this work enables faster on-device ai assistants and text generation, reducing latency and cloud dependency for mobile users.

source: arxiv machine learning: efficient on-device diffusion llm inference with mobile npu