source: hugging face blog: fine-tuning nvidia cosmos predict 2.5 with lora/dora for robot video generation
level: technical
nvidia cosmos predict 2.5 is a large world model that generates physically plausible videos from text, images, or video clips. adapting it to specific domains like robot manipulation requires fine-tuning, but full fine-tuning of its 2 billion parameters is costly and risks forgetting general knowledge. lora and dora inject small trainable adapters into the frozen model, reducing memory use and keeping adapter files portable. this guide shows how to fine-tune on a single gpu using the diffusers and accelerate libraries, then generate synthetic robot trajectories for downstream learning tasks.
the training uses 92 robot manipulation videos with text prompts for pick-and-place tasks. the model learns to predict video frames conditioned on the first two frames and a text prompt. lora adapters are added to attention and feedforward layers in the diffusion transformer, with all other weights frozen. the loss follows a rectified flow formulation, training the model to predict velocity between noise and clean data. training for 100 epochs takes about 17 hours on one h100 gpu or 2.5 hours on eight. checkpoints save adapter weights as small safetensors files.
for inference, the adapter is loaded and fused into the base model to avoid overhead. the script generates videos from test prompts and initial frames. evaluation uses sampson error to measure geometric consistency between frames and across views, plus an llm judge scoring physical plausibility and instruction following. lora and dora make it practical to adapt the model to new domains with limited compute, enabling scalable synthetic data generation for robot learning.
why it matters: parameter-efficient fine-tuning of video world models enables cost-effective generation of synthetic robot training data, reducing the need for expensive real-world demonstrations.
source: hugging face blog: fine-tuning nvidia cosmos predict 2.5 with lora/dora for robot video generation