Alibaba Cloud Aegaeon Reduces GPUs for AI Models by Over 80% with New Pooling System

Alibaba Cloud introduces Aegaeon, a new GPU pooling system that can reduce the number of Nvidia GPUs needed for AI model inference by more than 80%.
Alibaba Cloud Aegaeon Reduces GPUs for AI Models by Over 80% with New Pooling System

Alibaba Cloud Proclaims New System that Reduces GPUs for AI Models by Greater Than 80%

Alibaba Cloud has disclosed a new GPU pooling system called Aegaeon, alleging that it can reduce the number of Nvidia GPUs for running large language models (LLMs) by 82 percent-a finding that the company arrived at after a several-month beta test in its own Model Studio marketplace and later published in a peer-reviewed paper for the 2025 ACM Symposium on Operating Systems.

What Is Aegaeon, And How Does It Work

Aegaeon, an inference-time scheduler, cannot accelerate ML training processes, but increases throughput toward maximum when the already-trained model serves real-time requests with volatile or sporadic demand.

Instead of dedicating one GPU to a single AI model, Aegaeon creates a shared pool of accelerators. It virtualizes access at the "token level," allowing scheduling of incredibly small pieces of work from several different models over the entire pool all at once. In effect, a single Nvidia H20 GPU might serve several different LLMs at once, thus greatly increasing over-all efficiency.

Primary Findings from Production Tests

Over the months, coexisting in a live production environment, the paper was co-authored by the authors from Peking University and from Alibaba IaC, and it reported notable reductions in the required hardware:

  • GPU Reduction: The number of GPUs needed to support dozens of LLMs (up to 72 billion parameters in size) dropped from 1,192 to just 213.
  • Efficiency Gains: Aegaeon boosts "goodput," a measure of effective output, from 1.5 to 9 times higher when compared with other systems such as ServerlessLLM and MuxServe in benchmark tests.

The tests were reportedly conducted on Nvidia H20 GPUs, one of the main accelerators available to Chinese companies under U.S. export controls.

Wider Implications For The AI Industry

These Aegaeon findings appear to indicate that cloud providers may be able to extract significant extra performance from their existing hardware. This will be most relevant for markets with limited access to the latest AI chips.

But whether those efficiencies could be achieved in any environment other than Alibaba's own highly optimized one remains unknown. The paper did not specify what types of network fabric were used, and Alibaba has an unusual control over its vertically integrated infrastructure, which may have aided Aegaeon's success.

That said, these encourage other operators of cloud environments, which are searching for any possible ways to handle the soaring demand for AI inference within a realm of limited supply on accelerators.

About the author

mgtid
Owner of Technetbook | 10+ Years of Expertise in Technology | Seasoned Writer, Designer, and Programmer | Specialist in In-Depth Tech Reviews and Industry Insights | Passionate about Driving Innovation and Educating the Tech Community Technetbook

Join the conversation