Nvidia Cosmos 3 Mixture of Transformers Platform for Physical AI Foundation Models and Robotics Action Prediction within the Cosmos Coalition Open Weights Ecosystem
The launch event held at GTC Taipei, ComputEX introduced Nvidia’s Cosmos 3 as the open world foundation model specifically made for physical AI. Leveraging the Mixture of Transformers architecture, Cosmos 3 combines vision reasoning, physical world generation, and robotic action prediction in a single entity. The platform allows the model to understand and process data inputs including text, images, videos, sound, and physical actions native to the AI systems. This is projected to drastically reduce the training and evaluation time for autonomous systems from months to days.
This launch marks a crucial step towards adopting an open weights methodology in robotics. To hasten this transition, Nvidia also established the Cosmos Coalition. The coalition is an international consortium comprising foundational model developers and hardware manufacturers such as Agile Robots, Black Forest Labs, Generalist, LTX, Runway, and Skild AI. To standardize robotics platform integration and enable interoperability, all entities will work under the Linux Foundation's OpenMDW 1.1 license, which permits usage and modification of weights, documentation, and code by all stakeholders.
As a testament to this, we're witnessing the dawn of the physical AI revolution. These advances are made possible through the rapid progress in language, vision, and world models with multimodal reasoning capabilities. The Cosmos 3 series offers a generational advantage, allowing engineers to create autonomous systems, humanoids, self driving vehicles and other vision AI capable of reasoning, perception, planning and taking action within the real physical environment.
Historically, training autonomous systems to adapt to unknown environments and scarce real world data has been problematic. The Mixture of Transformers architecture within Cosmos 3 rectifies this by assigning different transformers to handle reasoning and generation processes. This is crucial for accurately simulating physical behavior, predicting motion trajectories and generating simulated video clips, based on spatial temporal relationships. The system has been trained on billions of physical sample data, reducing computation while facilitating the learning process of robotics policies.
Nvidia's software architecture has been designed with three models that cater to diverse phases of hardware development: Cosmos 3 Super is specialized for post training applications in humanoids and self driving cars where spatial accuracy is of paramount importance; Cosmos 3 Nano, a highly optimized version, can process real time spatial reasoning and high quality video output within milliseconds; finally, Cosmos 3 Edge will be released shortly, focusing on real time inference on edge devices.
The architecture of the new model has demonstrated superiority in a wide variety of benchmark environments. Among the existing open weights models, Cosmos 3 secured first place in world generation accuracy on the Physics IQ, R Bench and PAI Bench datasets. It also topped the RoboLab and RoboArena leaderboards with exceptional action policy execution performance, and excelled on VANTAGE Bench and the TAR challenge.
Industrial leaders have begun to implement Cosmos 3 in their current production lines. Agile Robots is using it to create action conditioned trajectories for their FR3 and Thor 3 humanoids for complex industrial tasks involving two armed manipulation. Linker Vision is deploying it to enable efficient security video analysis and to conduct root cause analysis on security feeds from cameras. Notable partners who have begun to leverage this technology include Samsung, LG Electronics, and Li Auto.

