AMD Unveils RDNA 4 Modular SoC Design and Architecture Details at Hot Chips 2025
AMD provided a more elaborate view of its RDNA 4 GPU architecture at the Hot Chips 2025 Event. Though most of the architecture was revealed back in February, the new showing concentrated on new bits of information, especially on the flexible Modular SoC design underlying the Radeon RX 9000 series.
RDNA 4's Modular SoC Architecture
An important aspect of the presentation was the modular nature of RDNA 4, meant to be able to produce an array of GPU configurations with ease. This allows AMD to scale the architecture for certain high-end or low-end SKU differentiation.
- The Scalable Design: The central design can expand or reduce. For example, the Navi 44 GPU is configured with two Shader Engines and four GDDR6 memory controllers. AMD will add Shader Engines (SEs), L3 cache, Infinity Fabric interconnects, and memory controllers to create a higher-end chip like the Navi 48 (used in the RX 9070 XT).
- Data Flow: Shader Engines contain Work Group Processors (WGPs), which execute dual Compute Units. These WGPs communicate with memory controllers on the GL1 cache and return via Last Level (LL) cache functionality, ultimately interfaced to the caches via Infinity Fabric interconnects. The fabric can operate at a bandwidth of 1KB/clock for its frequency of 1.5-2.5 GHz.
- Flexible Harvesting: The modularity gives AMD the means of creating multiple product SKUs by the deliberate disablement of certain sections of the die (the harvesting process). This disables Shader Engines, WGPs, or even one memory device, giving the maximum flexibility to meet market demand. Under this methodology, there are currently four NAVI 48 SKUs and three NAVI 44 SKUs.
Memory and Bandwidth Optimization
AMD was asked about RDNA 4 regarding memory bandwidth, declaring its architectural fine-tuning was able to lower the entire raw requirement for bandwidth from earlier generations. This is backed by more recent compression techniques.
- Central Compression/Decompression: RDNA 4 relies on new hardware-based algorithms for compression that run entirely transparent to software. This leads to a bandwith-saving of roughly 25% on fabric bandwidth utilization. Hence this helps with energy saving too. It is also said to provide a marginal improvement in performance of the order of 15% with a few rasterization workloads.
- No to LPDDR Memory: AMD said that while opportunistic, LPDDR memory is out of the picture, for its bandwidth constraint and bigger package size requirements when it comes to discrete graphics cards.
Major Enhancements to Ray Tracing in RDNA 4
The presentation also summarised some architectural improvements that account for the ~2x increase in ray-tracing performance of RDNA 4 over RDNA 3.
- Oriented Bounding Boxes: This is hardware support that rotates the bounding boxes in better alignment with the geometry, significantly reducing false-positive ray intersections, hence improving efficiency.
- Out-of-Order Memory: Independent memory requests could be processed in an out-of-order fashion, masking latency and feeding the execution units, which is very important for divergent workloads like ray tracing.
- Dynamic Register Allocation: RDNA 4 can allocate shader registers dynamically with reference to the present requirement of a task. This leaves register space free, and many more "waves" of work could be in flight simultaneously in comparison to RDNA3's static and worst-case allocation.
- Wider BVH Structure: The Bounding Volume Hierarchy (BVH) structure was widened from 4-wide to 8-wide, thus doubling the throughput to coincide with the doubled intersection engines.
Other Important Improvements
AMD also provided insight into changes made to media and display engines, including the implementation of B-frames for AV1 encoding to minimize latency and direct integration of Radeon Image Sharpening 2 through the display block hardware.