level: technical
nvidia cosmos 3 is a unified world foundation model now available on hugging face. it combines video generation, physical reasoning, and action prediction in one model using a mixture-of-transformers architecture. developers previously needed separate models for each task. cosmos 3 handles all modalities—text, image, video, audio, and action—in a single forward pass. two sizes are offered: cosmos 3 nano with 8 billion parameters for workstation gpus, and cosmos 3 super with 32 billion parameters for large-scale data generation on hopper and blackwell gpus.
the model can generate realistic video from text, images, or actions. it can reason about motion, causality, and spatial relationships. it also predicts future video and action sequences. example uses include robot pick-and-place tasks, long-tail driving scenarios, and warehouse safety simulations. for video generation, detailed narrative prompts work best. for action generation, concise prompts with spatial references are recommended. the model is integrated with the hugging face diffusers library, allowing use through a simple pipeline with just a few lines of code.
nvidia also released synthetic data generation datasets for physical ai. these cover robotics, physics simulation, spatial reasoning, human motion, driving, and warehouse environments. the cosmos framework provides post-training scripts so users can fine-tune the model on their own robot or environment data. agent skills are included to help with setup, prompt drafting, and running inference. this release aims to give the physical ai community a single foundation model for simulating and understanding the real world.
why it matters: a single open model that handles perception, reasoning, and action can simplify building and testing physical ai systems like robots and self-driving cars.