Google TPU 8th Generation Architectures TPU 8t and TPU 8i Empowering Agentic Workflows and Autonomy

Google TPU 8th Generation Architectures TPU 8t and TPU 8i Empowering Agentic Workflows and Autonomy

Google Eighth Generation TPU Shift Architecting for Autonomy with TPU 8t Training Workhorse and TPU 8i Inference Engine Using Virgo and Boardfly Network Topologies

The performance of AI hardware markets which focused on delivering maximum floating point processing has reached its upper limit. The computing environment of 2026 requires systems that handle agentic workflows which involve complex reasoning loops that demand hardware with both imaginative capabilities and multi turn planning abilities. Google has divided its eighth generation Tensor Processor Unit design into TPU 8t for training purposes and TPU 8i for inference purposes as an approach to solve operational intensity conflicts.

Google TPU 8th Generation Architectures TPU 8t and TPU 8i Empowering Agentic Workflows and Autonomy
Google TPU 8th Generation Architectures TPU 8t and TPU 8i Empowering Agentic Workflows and Autonomy

The TPU 8t is built to handle the demanding requirements of large scale pre training work. The system employs SparseCore as its main component which functions as a dedicated accelerator that handles memory access patterns which arise during massive embedding lookup operations. The design achieves zero operation bottleneck removal for general purpose accelerators by separating these operations from the Matrix Multiply Unit which operates as the main processing unit of the system. Google has introduced native FP4 precision which enables halving of the bit rate per parameter as a new addition to its system. The design choice reduces energy consumption during data transfer because it enables entire model layers to be stored in local hardware buffers.

The Virgo Network employs a two layer non blocking topology which enables data transfer across the system infrastructure of the cluster. The fabric uses high radix switches to create a streamlined network structure which results in reduced training delays when applied to extensive distributed systems. The system can operate with over 134,000 chips because it integrates JAX and Pathways to deliver bi sectional bandwidth which reaches 47 petabits per second. The system permits massive datasets of a hundred petabytes to transfer from managed Lustre 10T storage directly into silicon memory through TPUDirect Storage as the system bypasses host CPU bottlenecks to achieve ten times faster data ingestion performance than the previous ironwood based generation.

The TPU 8i operates as a specialized system for gathering samples while conducting reasoning. Inference heavy workloads, particularly those involving chain of thought processing and Mixture of Experts architectures, require low latency all to all communication. Google has established the Boardfly topology as the solution to this challenge. Boardfly employs a design approach which combines hierarchical structures with high radix capabilities to achieve maximum network diameter reduction that drops from 16 hops down to seven hops. The distance reduction of 56 percent produces a direct impact which decreases tail latency for communication intensive reasoning models.

The 8i chip further distinguishes itself with a new Collectives Acceleration Engine. The component collects synchronization results from all cores with nearly no latency which speeds up auto regressive decoding processes essential for multi agent workflows. The system features an extremely large increase in on chip SRAM capacity which has grown to three times the memory capacity of the previous generation. The chip uses silicon to store active Key Value cache which reduces idle time for cores because it enables models to operate continuously during extended decoding periods.

The eighth generation architectures use Google’s custom Arm based Axion CPU headers which provide necessary computational capacity for advanced data preprocessing operations. The company claims to achieve double performance per watt improvement by managing all hardware components from silicon chips to cooling distribution units. The units implement fourth generation liquid cooling systems to handle power densities which would lead standard air cooled systems to experience performance throttling. TPU 8t and 8i move away from traditional hardware systems by introducing dedicated pathways which treat training and serving as completely separate activities that enhance agentic computing efficiency.

TPU 8t and TPU 8i at a Glance

Feature TPU 8t TPU 8i
Primary Workload Large-scale pre-training Sampling, serving, and reasoning
Network Topology 3D torus Boardfly
Specialized Chip Features SparseCore (Embeddings) & LLM Decoder Engine CAE (Collectives Acceleration Engine)
HBM Capacity 216 GB 288 GB
On-Chip SRAM (Vmem) 128 MB 384 MB
Peak FP4 PFLOPs 12.6 10.1
HBM Bandwidth 6,528 GB/s 8,601 GB/s (~1.3x of TPU 8t)
CPU Header Arm Axion Arm Axion

About the author

mgtid
Owner of Technetbook | 10+ Years of Expertise in Technology | Seasoned Writer, Designer, and Programmer | Specialist in In-Depth Tech Reviews and Industry Insights | Passionate about Driving Innovation and Educating the Tech Community Technetbook

Join the conversation

Newsletter Subscription