NVIDIA Blackwell Sets LLM Training Records in MLPerf v5.1 with NVFP4 Precision and Partner Support

NVIDIA's Blackwell Ultra GPU architecture sets new performance records in MLPerf Training v5.1, using NVFP4 precision to accelerate LLM training.
NVIDIA Blackwell Sets LLM Training Records in MLPerf v5.1 with NVFP4 Precision and Partner Support

NVIDIA Blackwell and NVFP4 Dominate MLPerf Training v5.1

MLPerf's training round v5.1 fulfilled the benchmarks with NVIDIA excellence. The improvements measured due to the new architecture of NVIDIA Blackwell Ultra GPU and the quick introduction of a NVFP4 precision format for training LLMs and other AI systems.

Blackwell Ultra Architecture Awakened

The first time this training round employed the rack-scale system GB300 NVL72 with the NVIDIA Blackwell Ultra GPU. Compared to the system with the same number of GPUs based on Hopper architecture, the Blackwell Ultra system produced:

  • More than more than 4x favorable response in Llama 3.1 405B pretraining.
  • Almost 5x improvement in Llama 2 70B LoRA fine-tuning.

The advances are attributed to architectural developments such as new Tensor Cores providing 15 petaflops of NVFP4 AI compute and 279GB of HBM3e memory. Networking solution NVIDIA Quantum-X800 InfiniBand platform further debuted, adding scale-out networking bandwidth, thereafter doubled.

NVFP4 Precision Accelerates LLM Training

One of the main contributors to the results of this round was the employing of NVFP4 precision for the calculations, this being the first for MLPerf Training. The NVIDIA Blackwell architecture can do FP4 calculations with 2 times higher frequencies than FP8, while Blackwell Ultra boosts that up to 3 times. So far, NVIDIA is the only company that had submitted MLPerf Training results for FP4 precision while simultaneously meeting the strict accuracy requirements of the benchmark.

New Performance and Scaling Records

NVIDIA has established a new time-to-train record for the Llama 3.1 405B model, training it in about 10 minutes with over 5000 Blackwell GPUs. This was 2.7 times quicker than the best Blackwell-based submission from the previous round, which had grown to more than double the GPU count and utilising NVFP4 precision.

To show per-GPU performance gains a submission with 2560 Blackwell GPUs was 45% faster than a submission with similar GPU count in the last round.

Results on New Benchmarks

NVIDIA also achieved performance records on two new benchmarks introduced in this round:

  • Llama 3.1 8B: Serving as a replacement for the previous BERT-large model, this smaller LLM was trained by NVIDIA within 5.2 minutes using 512 Blackwell Ultra GPUs.
  • FLUX.1: This competitive image-generating benchmark replaced Stable Diffusion v2; NVIDIA was the only platform to submit for it with a record of 12.5 minutes using 1,152 Blackwell GPUs.

It also continues to maintain world records in previously established tests in graph neural networks, object detection, and recommender systems.

Broad Partner Ecosystem Participation

Fifteen additional NVIDIA ecosystem partners, notable Dell Technologies, Hewlett Packard Enterprise, Lenovo, and Supermicro, also submitted for this round, signifying widespread acceptance of the NVIDIA platform.

NVIDIA Blackwell Sets LLM Training Records in MLPerf v5.1 with NVFP4 Precision and Partner Support
NVIDIA Blackwell Sets LLM Training Records in MLPerf v5.1 with NVFP4 Precision and Partner Support
NVIDIA Blackwell Sets LLM Training Records in MLPerf v5.1 with NVFP4 Precision and Partner Support

About the author

mgtid
Owner of Technetbook | 10+ Years of Expertise in Technology | Seasoned Writer, Designer, and Programmer | Specialist in In-Depth Tech Reviews and Industry Insights | Passionate about Driving Innovation and Educating the Tech Community Technetbook

Join the conversation