source: google deepmind: introducing gemini omni
level: technical
google deepmind introduced gemini omni, a model that creates video from any combination of inputs. the first release, gemini omni flash, is available in the gemini app, google flow, and youtube shorts. it can generate high-quality videos grounded in real-world knowledge and allows editing through natural language conversation. users can combine images, audio, video, and text as input, and the model maintains consistency across edits, preserving characters, physics, and scene context.
the model reasons about physics and world knowledge to produce realistic scenes. it understands forces like gravity and fluid dynamics, and draws on history, science, and culture for storytelling. examples include creating alphabet videos with unusual items, claymation explainers of protein folding, and scenes that sync visuals to music beats. users can reference existing images, videos, or audio to guide style, motion, or effects, and can transform sketches into realistic footage or apply poses from one video to a character from an image.
for responsible use, google includes a synthid digital watermark in all generated videos and initially limits voice generation to digital avatars that look and sound like the user. editing audio and speech is still being tested. the model will roll out to developers and enterprise customers via apis in the coming weeks. gemini omni flash is available to google ai plus, pro, and ultra subscribers, and at no cost on youtube shorts and the youtube create app.
why it matters: it gives ai practitioners a tool to generate and edit video from multimodal inputs, enabling rapid prototyping of visual content grounded in world knowledge.