Scale Synthetic Data and Physical AI Reasoning with NVIDIA Cosmos World Foundation Models

The next generation of AI-driven robots like humanoids and autonomous vehicles depends on high-fidelity, physics-aware training data. Without diverse and representative datasets, these systems don’t get proper training and face testing risks due to poor generalization, limited exposure to real-world variations, and unpredictable behavior in edge cases. Collecting massive real-world datasets for training is expensive, time-intensive, and often constrained by possibilities.

NVIDIA Cosmos addresses this challenge by accelerating world foundation model (WFM) development. At the core of its platform, Cosmos WFMs speed up synthetic data generation and act as a foundation for post-training, to develop downstream domain or task-specific physical AI models to solve these challenges. This post explores the latest Cosmos WFMs, their key capabilities that advance physical AI, and how to use them.?

Cosmos Transfer for photorealistic videos grounded in physics

Cosmos Transfer WFM generates high-fidelity world scenes from structural inputs, ensuring precise spatial alignment and scene composition.

Employing the ControlNet architecture, Cosmos Transfer preserves pretrained knowledge, enabling structured, consistent outputs. It utilizes spatiotemporal control maps to dynamically align synthetic and real-world representations, enabling fine-grained control over scene composition, object placement, and motion dynamics.

Inputs:

Structured visual or geometric data: segmentation maps, depth maps, edge maps, human motion keypoints, LiDAR scans, trajectories, HD maps, and 3D bounding boxes.
Ground truth annotations: high-fidelity references for precise alignment.

Output: Photorealistic video sequences with controlled layout, object placement, and motion.

Figure 1. On the left, a virtual simulation or ‘ground truth’ created in NVIDIA Omniverse. On the right, photoreal transformation using Cosmos Transfer

Key capabilities:

Generate scalable, photorealistic synthetic data that aligns with real-world physics.
Control object interactions and scene composition through structured multimodal inputs.

Using Cosmos Transfer for controllable synthetic data

With generative AI APIs and SDKs, NVIDIA Omniverse accelerates physical AI simulation. Developers use NVIDIA Omniverse, built on OpenUSD, to create 3D scenes that accurately simulate real-world environments for training and testing robots and autonomous vehicles. These simulations serve as ground truth video inputs for Cosmos Transfer, combined with annotations and text instructions. Cosmos Transfer enhances photorealism while varying environment, lighting, and visual conditions to generate scalable, diverse world states.

This workflow accelerates the creation of high-quality training datasets, ensuring AI agents generalize effectively from simulation to real-world deployment.?

Workflow diagram showing ground-truth generation using generative AI APIs/SDKs in NVIDIA Omniverse, transformed into photoreal output with Cosmos Transfer. — *Figure 2*. *Generative API and SDKs in NVIDIA Omniverse power ground truth simulation for Cosmos Transfe*r

A photoreal video depicts an arm robot following an instructed trajectory. — *Figure 3. A photoreal video produced by Cosmos Transfer*

Cosmos Transfer enhances robotics development by enabling realistic lighting, colors, and textures in the Isaac GR00T Blueprint for synthetic manipulation motion generation and Omniverse Blueprint for Autonomous Vehicle Simulation for varying environmental and weather conditions for training. This photorealistic data is crucial for post-training policy models, ensuring smooth simulation-to-reality transfer and supporting model training for perception AI and specialized robot models like GR00T N1.

Running inference with Cosmos Transfer

Here are some sample commands to use the Cosmos-Transfer1-7B model for inference.

Cosmos Transfer is openly available on Hugging Face under the NVIDIA Open Model License. Generate a Hugging Face access token, log in with the CLI, accept the LlamaGuard-7b terms, and follow Cosmos-Transfer1 GitHub instructions.

The following command downloads the base model, tokenizer, and guardrail models for Cosmos-Transfer1:

PYTHONPATH=$(pwd) python scripts/download_checkpoints.py --output_dir checkpoints/

Use the following command to run the model. You can customize settings using a JSON file, enabling features like blur, canny, depth, or segmentation ControlNets individually or in combination.

export CUDA_VISIBLE_DEVICES=0
PYTHONPATH=$(pwd) python cosmos_transfer1/diffusion/inference/transfer.py \
    --checkpoint_dir checkpoints \
    --input_video_path path/to/input_video.mp4 \
    --video_save_name output_video \
    --sigma_max 70 \
    --controlnet_specs spec.json

Cosmos WFMs can be post-trained into a VLA policy model, where video output is replaced by action output for robots to execute. For context, a policy model generates actions to be taken by the Physical AI system based on the current observations and the given task. A well-trained WFM can model such dynamic patterns of the world, and serve as a good initialization of the policy model.

Learn more about Cosmos Transfer examples on GitHub.

Cosmos Predict for generating future world states

Cosmos Predict WFM is designed to model future world states as video from multimodal inputs, including text, video, and start-end frame sequences. It is built using transformer-based architectures that enhance temporal consistency and frame interpolation.

Key capabilities:

Generates realistic world states directly from text prompts.
Predict next states based on video sequences by predicting missing frames or extending motion.
Multiframe generation between a starting and ending image, creating a complete, smooth sequence.

Cosmos Predict WFM provides a strong foundation for training downstream world models in robotics and autonomous vehicles. You can post-train these models to generate actions instead of video for policy modeling or adapt it for visual-language understanding to create custom perception AI models.

Cosmos Reason to perceive, reason, and respond intelligently

Cosmos Reason is a fully customizable multimodal AI reasoning model that is purpose-built to understand motion, object interactions, and space-time relationships. Using chain-of-thought (CoT) reasoning, the model interprets visual input, predicts outcomes based on the given prompt, and rewards the optimal decision. Unlike text-based LLMs, it grounds reasoning in real-world physics, generating clear, context-aware responses in natural language.

Input: Video observations and a text-based query or instruction.

Output: Text response generated through long-horizon CoT reasoning.

Key capabilities:

Knows how objects move, interact, and change over time.
Predicts and rewards the next best action based on input observation.
Continuously refines decision-making.
Purpose-built for post-training to build perception AI and embodied AI models.

Training pipeline

Cosmos Reason is trained in three stages, enhancing its ability to reason, predict, and respond to decisions in real-world scenarios.

Pretraining: Uses a Vision Transformer (ViT) to process video frames into structured embeddings, aligning them with text for a shared understanding of objects, actions, and spatial relationships.

Supervised fine-tuning (SFT): Specializes the model in physical reasoning across two key levels. General fine-tuning enhances language grounding and multimodal perception using diverse video-text datasets, while more training on physical AI data sharpens the model’s ability to reason about real-world interactions. It learns object behaviors like how objects can be used in the real world, action sequences, determining how multi-step tasks unfold, and spatial feasibility to distinguish realistic from impossible placements.

Model takes video and text input, analyzes and responds to the task, through reinforcement learning assigns rewards and delivers the winning response. — *Figure 4. Reinforcement Learning Feedback Loop continuously improves through positive and negative feedback and model adjustments*

Reinforcement learning (RL): The model evaluates different reasoning paths and updates itself only when a better decision emerges through trial and reward feedback. Instead of relying on human-labeled data, it uses rule-based rewards:

Entity recognition: Rewarding accurate identification of objects and their properties.
Spatial constraints: Penalizing physically impossible placements while reinforcing realistic object positioning.
Temporal reasoning: Encouraging correct sequence prediction based on cause-effect relationships.

Get started

Cosmos WFMs are available on Hugging Face with inference scripts on GitHub for Cosmos-Predict1 and Cosmos-Transfer1.

Try Cosmos Predict preview NIM on build.nvidia.com.

Use this workflow guide to use Cosmos Transfer for synthetic data generation.

Explore free NVIDIA GTC 2025 Cosmos sessions. Tune in to our upcoming livestream, on Wednesday, March 26, at 11:00 AM PDT to hear more about the latest platform updates.