source: google deepmind: introducing gemma 4 12b: a unified, encoder-free multimodal model
level: technical
google deepmind introduced gemma 4 12b, a new model that handles text, images, and audio without separate encoders. it sits between the smaller e4b and the larger 26b mixture of experts model. the model uses a unified architecture where vision and audio inputs go directly into the language model backbone. for vision, a lightweight embedding module replaces the usual encoder. for audio, the raw signal is projected into the same space as text tokens. this design cuts latency and memory use.
the model runs on laptops with 16gb of vram or unified memory. it includes multi-token prediction drafters to reduce response time. benchmark results approach the 26b model while using less than half the memory. the release targets developers building agentic workflows that need local multimodal reasoning. the model is available under an apache 2.0 license. weights are on hugging face and kaggle. it works with tools like ollama, lm studio, hugging face transformers, and vllm.
google also released a skills repository to help agents build with gemma models. the model supports fine-tuning with unsloth and deployment on google cloud. gemma 4 models have passed 150 million downloads. developers have used them for robotic arms, enterprise security, and more. the new 12b model aims to make advanced multimodal ai accessible on everyday hardware without cloud dependency.
why it matters: running multimodal models locally on consumer hardware reduces latency, cost, and privacy risks for ai applications.
source: google deepmind: introducing gemma 4 12b: a unified, encoder-free multimodal model