language-guided 3d motion forecasting with molmomotion

source: hugging face blog: molmomotion: language-guided 3d motion forecasting

level: technical

molmomotion is a new model that forecasts how objects will move in 3d space. given a video frame, some 3d points on an object, and a text instruction like "move and rotate the wooden bowl," it predicts where those points will go over the next few seconds. it uses molmo 2 as its backbone to connect language to objects and points in an image. the model comes in two versions: an autoregressive one that predicts step by step for smooth paths, and a flow-matching one that handles uncertainty better when multiple futures are possible.

the model was trained on molmomotion-1m, a large dataset of 1.16 million videos with 3d point trajectories and action descriptions. this dataset was built automatically from internet videos by extracting object-grounded 3d tracks, filtering out noise, and keeping only clips where objects actually move. to test the model, the team created pointmotionbench, a benchmark with 2,700 human-validated clips covering many objects and motion types. on this benchmark, molmomotion outperformed all other tested methods, including video generators and simple baselines.

molmomotion's predictions are useful for downstream tasks. in robot planning, a policy using molmomotion succeeded on 76.3% of pick-and-place tasks in simulation, compared to 56.0% for a baseline. it also learned faster. for video generation, feeding molmomotion's predicted paths into a video model made the generated videos follow instructions more closely, especially for small precise movements. the model weights, dataset, and benchmark are all openly released for the community to use and improve.

why it matters: accurate 3d motion forecasting can improve robot manipulation and make generated videos more physically realistic, directly benefiting ai systems that interact with the physical world.

source: hugging face blog: molmomotion: language-guided 3d motion forecasting