Alibaba Cloud Aegaeon Reduces GPUs for AI Models by Over 80% with New Pooling System

Alibaba Cloud Proclaims New System that Reduces GPUs for AI Models by Greater Than 80%

Alibaba Cloud has disclosed a new GPU pooling system called Aegaeon, alleging that it can reduce the number of Nvidia GPUs for running large language models (LLMs) by 82 percent-a finding that the company arrived at after a several-month beta test in its own Model Studio marketplace and later published in a peer-reviewed paper for the 2025 ACM Symposium on Operating Systems.

What Is Aegaeon, And How Does It Work

Aegaeon, an inference-time scheduler, cannot accelerate ML training processes, but increases throughput toward maximum when the already-trained model serves real-time requests with volatile or sporadic demand.

Instead of dedicating one GPU to a single AI model, Aegaeon creates a shared pool of accelerators. It virtualizes access at the "token level," allowing scheduling of incredibly small pieces of work from several different models over the entire pool all at once. In effect, a single Nvidia H20 GPU might serve several different LLMs at once, thus greatly increasing over-all efficiency.

Primary Findings from Production Tests

Over the months, coexisting in a live production environment, the paper was co-authored by the authors from Peking University and from Alibaba IaC, and it reported notable reductions in the required hardware:

GPU Reduction: The number of GPUs needed to support dozens of LLMs (up to 72 billion parameters in size) dropped from 1,192 to just 213.
Efficiency Gains: Aegaeon boosts "goodput," a measure of effective output, from 1.5 to 9 times higher when compared with other systems such as ServerlessLLM and MuxServe in benchmark tests.

The tests were reportedly conducted on Nvidia H20 GPUs, one of the main accelerators available to Chinese companies under U.S. export controls.

Wider Implications For The AI Industry

These Aegaeon findings appear to indicate that cloud providers may be able to extract significant extra performance from their existing hardware. This will be most relevant for markets with limited access to the latest AI chips.

But whether those efficiencies could be achieved in any environment other than Alibaba's own highly optimized one remains unknown. The paper did not specify what types of network fabric were used, and Alibaba has an unusual control over its vertically integrated infrastructure, which may have aided Aegaeon's success.

That said, these encourage other operators of cloud environments, which are searching for any possible ways to handle the soaring demand for AI inference within a realm of limited supply on accelerators.

Technetbook | The Tech Experts

Alibaba Cloud Aegaeon Reduces GPUs for AI Models by Over 80% with New Pooling System

Alibaba Cloud Proclaims New System that Reduces GPUs for AI Models by Greater Than 80%

What Is Aegaeon, And How Does It Work

Primary Findings from Production Tests

Wider Implications For The AI Industry

About the author

Post a Comment

Windows 11 KB5079473 Update March 2026 Features Security Patches Gaming Crashes and Troubleshooting Guide

NVIDIA VeraRubin HBM4 Samsung SK Hynix Exclusive AI Accelerator Memory Supply Late 2026

Nvidia GTC 2026 Silicon Photonics Rubin Ultra Chip Launch CPO Data Center Infrastructure and Market Trends

Marathon Patch Notes Update 1.0.04 Analysis Thermal Scope Changes and UESC Combat Balance

Fortnite Economy Update V Bucks Price Changes Battle Pass Restructuring and Epic Rewards Program Details