Alibaba Cloud Proclaims New System that Reduces GPUs for AI Models by Greater Than 80%
Alibaba Cloud has disclosed a new GPU pooling system called Aegaeon, alleging that it can reduce the number of Nvidia GPUs for running large language models (LLMs) by 82 percent-a finding that the company arrived at after a several-month beta test in its own Model Studio marketplace and later published in a peer-reviewed paper for the 2025 ACM Symposium on Operating Systems.
What Is Aegaeon, And How Does It Work
Aegaeon, an inference-time scheduler, cannot accelerate ML training processes, but increases throughput toward maximum when the already-trained model serves real-time requests with volatile or sporadic demand.
Instead of dedicating one GPU to a single AI model, Aegaeon creates a shared pool of accelerators. It virtualizes access at the "token level," allowing scheduling of incredibly small pieces of work from several different models over the entire pool all at once. In effect, a single Nvidia H20 GPU might serve several different LLMs at once, thus greatly increasing over-all efficiency.
Primary Findings from Production Tests
Over the months, coexisting in a live production environment, the paper was co-authored by the authors from Peking University and from Alibaba IaC, and it reported notable reductions in the required hardware:
- GPU Reduction: The number of GPUs needed to support dozens of LLMs (up to 72 billion parameters in size) dropped from 1,192 to just 213.
- Efficiency Gains: Aegaeon boosts "goodput," a measure of effective output, from 1.5 to 9 times higher when compared with other systems such as ServerlessLLM and MuxServe in benchmark tests.
The tests were reportedly conducted on Nvidia H20 GPUs, one of the main accelerators available to Chinese companies under U.S. export controls.
Wider Implications For The AI Industry
These Aegaeon findings appear to indicate that cloud providers may be able to extract significant extra performance from their existing hardware. This will be most relevant for markets with limited access to the latest AI chips.
But whether those efficiencies could be achieved in any environment other than Alibaba's own highly optimized one remains unknown. The paper did not specify what types of network fabric were used, and Alibaba has an unusual control over its vertically integrated infrastructure, which may have aided Aegaeon's success.
That said, these encourage other operators of cloud environments, which are searching for any possible ways to handle the soaring demand for AI inference within a realm of limited supply on accelerators.
