Microsoft Azure Launches NVIDIA GB300 Blackwell Ultra GPU Cluster for Large-Scale AI Model Training

Microsoft Azure launches its first production cluster with NVIDIA GB300 Blackwell Ultra GPUs, scaling AI with new VMs to cut training times for huge m
mgtid Published by
Microsoft Azure Launches NVIDIA GB300 Blackwell Ultra GPU Cluster for Large-Scale AI Model Training

Microsoft Azure Launches Production Cluster with NVIDIA GB300 "Blackwell Ultra" GPUs

Microsoft's announcement on rolling out its first at-scale production cluster featuring NVIDIA's GB300 "Blackwell Ultra" GPUs is good news for Azure this new platform being capable of working with huge AI models and cutting training times drastically.

Scaling AI with Blackwell Ultra Architecture

The initial large target cluster integrates over 4600 NVIDIA GB300 "Blackwell Ultra" GPUs interconnected with next-generation InfiniBand fabric. This launch represents the first phase of Microsoft's scaling plans toward hundreds of thousands of these GPUs to run advanced AI workloads across its global datacenters.

According to Microsoft, the new infrastructure will cut down the training time for complex AI models from months to weeks. It will also enable training of models beyond the scale of 100 trillion parameters-a meaningful leap in scale and complexity of AI models.

Technical Specifications of the New Azure VMs

The new Microsoft Azure ND GB300 v6 Virtual Machines are optimized for demanding AI tasks such as reasoning models, agentic AI, and multimodal generative AI. Each rack in the cluster contains 18 VMs, with each VM powered by 72 GPUs. Key specifications for a single rack are as follows:

  • Processing Power: 72 NVIDIA Blackwell Ultra GPUs paired with 36 NVIDIA Grace CPUs.
  • Networking: 800 Gb/s per GPU of cross-rack bandwidth via NVIDIA Quantum-X800 InfiniBand.
  • Intra-Rack Bandwidth: 130 TB/s of NVIDIA NVLink bandwidth.
  • Memory: 37 TB of fast memory.
  • Performance: Up to 1,440 petaflops of FP4 Tensor Core performance.

Advanced Infrastructure for Maximum Efficiency

To interconnect these massively capable racks, Azure employs a non-blocking fat-tree architecture utilizing NVIDIA Quantum-X800 InfiniBand. This architecture minimizes communication overhead and maximizes GPU utilization, which allows researchers to iterate on AI training workloads faster and at cheaper costs.

The platform also uses NVIDIA SHARP technology, which speeds up collective operations by performing mathematical calculations on the network switches, effectively doubling the available bandwidth for these operations.

Advanced cooling and power distribution systems maintain thermal stability while being designed to accommodate the high-power requirements of the new GPU clusters with minimum water usage.

Availability

The new Azure VMs equipped with NVIDIA GB300 "Blackwell Ultra" GPUs are deployed and open for customers. This partnership, says NVIDIA, marks an inflection point for the United States in the global AI race.

About the author

mgtid
Owner of Technetbook | 10+ Years of Expertise in Technology | Seasoned Writer, Designer, and Programmer | Specialist in In-Depth Tech Reviews and Industry Insights | Passionate about Driving Innovation and Educating the Tech Community Technetbook

Post a Comment