NVIDIA reveals DSX platform blueprint for building efficient AI factories using digital twin simulation and modular software libraries to minimize energy costs
NVIDIA has revealed its NVIDIA DSX platform, establishing a comprehensive blueprint to guide infrastructure developers in the construction and management of AI factories. The announcement revealed that the DSX platform will integrate open source software, a series of modular libraries, APIs and reference designs that are aimed at ensuring that all levels of computing, starting from silicon to building design, have a common blueprint for operating large scale intelligence workloads with minimized costs.
In a bid to ease developers difficulties when installing and operating the physical clusters of hardware that are essential for physical infrastructure, a system has been built which links software to physical infrastructure to bring down the energy cost per token. CEO of NVIDIA, Jensen Huang stated:
"We are not just shipping chips, we are giving every infrastructure builder a complete playbook to build AI factories."
Huang further added that through this platform, developers can simulate the facility design, validate the performance and also reliability prior to installing any physical equipment in the proposed infrastructure, all while cutting down costs related to electricity.
DSX MaxLPS has been introduced by the new system architecture, which is a software package that aims to maximize the token performance within a particular power constraint. By merging the usage of 45 degrees Celsius liquid cooling with an optimized on rack power management system, as many as 40% more graphics processing units can be operated at their optimum efficiency points. Such a method is thought to minimize the effect of heavy computational workload with minimum effects, reducing the overall energy cost.
Working in conjunction with this hardware, DSX OS has been unveiled which is a Linux based open source operating system aimed at supervising and operating the facility for the AI factory. The system manages basic system health, automated maintenance routines, resource allocation and security in a multi tenant environment, to create a consistent runtime within massive physical hardware installations for cloud service providers.
NVIDIA DSX Sim allows the developers to create high fidelity digital twin simulations of computing facilities. System designers may utilize the simulation to optimize physical layout and power distribution of proposed infrastructure prior to purchasing hardware components. This tool is integrated to the wider NVIDIA Omniverse platform using software developed by other companies, such as PTC, Cadence and Siemens to automate design validation.
DSX Flex aims to address the tremendous electrical demands of AI factories by enabling the computing facility to be directly connected to local utility services and dynamically controlling power usage based on grid signals; allowing for optimization during peak hours or management of on site renewable energy storage systems. As a demonstration of grid responsive power control without interrupting on going calculations, a pilot program is in place with Silicon Valley Power and Emerald AI.
DSX Exchange handles the management of signals for cooling plants and power distribution networks between operational technology (OT) and information technology (IT) systems. These different software programs are supported by a comprehensive DSX Reference Design where designers are provided with pre validated hardware configurations including the blueprint for the cooling systems, the construction of the facility and for different hardware generations, that includes layouts for the server racks.
Major server manufacturers, including Supermicro, Lenovo, Dell Technologies, HPE and other manufacturers such as ASUS, Pegatron, Gigabyte, Quanta Cloud Technology, Foxconn, Wistron, Pegatron, ASUS and Wiwynn from Taiwan, will all begin building DSX ready machines. These manufacturers will also contribute digital models to the simulation tool, allowing customers to test entire server racks prior to installation.
The software stack has been deployed in active data centers operated by cloud infrastructure providers including CoreWeave, Lambda, Nebius, Crusoe, Firmus, Yotta Data Services, IREN and Nscale, to reduce the operational costs associated with hardware management and speed up installations. Other industry partners such as Red Hat, Mirantis and Spectro Cloud, are adopting portions of the OS software to manage container orchestration and security of distributed networks.
