Skip to main content

Google unveils eighth generation TPU: dual chip for the era of AI agents

AI article illustration for ai-jarvis.eu
Google unveiled the eighth generation of its own AI chips, TPUs, at the Cloud Next 2026 conference. For the first time in history, the specialized accelerator is split into two separate units — TPU 8t for model training and TPU 8i for inference. The decision reflects the growing demand for AI agent infrastructure, which requires a different approach to computational power and latency.

Why Google Separated Training from Inference

For ten years, Google developed universal chips capable of both training neural networks and running them in real time. With the advent of the AI agent era — autonomous systems that iteratively solve problems, plan, and learn from their actions — it became clear that a single architecture was no longer optimally sufficient.

“With the rise of AI agents, we have concluded that the community would benefit from chips individually specialized for the needs of training and serving,” said Amin Vahdat, senior vice president at Google and chief technologist for AI and infrastructure. Sundar Pichai, CEO of Alphabet, added that the TPU 8i architecture is designed “for massive throughput and low latency needed to simultaneously run millions of agents cost-effectively.”

This split is not unique to Google. Amazon Web Services previously introduced separate chips Inferentia (2018) and Trainium (2020), Microsoft announced the second generation of its own AI chip Maia 200 in January 2026, and Meta is collaborating with Broadcom on developing multiple variants of AI processors. The trend of specialized silicon is thus accelerating across all of Big Tech.

TPU 8t: Supercomputer for Training Frontier Models

The TPU 8t training unit is built for the most demanding foundation model development tasks. Google states that the new chip offers nearly 3x the computational performance per pod compared to the previous generation Ironwood at the same price. More specifically: one superpod now contains up to 9,600 chips and 2 petabytes of shared high-speed memory with double the bandwidth between chips compared to Ironwood.

The total performance of such a configuration reaches 121 ExaFlops. That is computational capacity that not even the largest national supercomputer centers had a few years ago. Connection via the new network infrastructure Virgo Network enables near-linear scaling up to a million chips in a single logical cluster.

A crucial metric is also “goodput” — the proportion of productive compute time. Google is targeting more than 97 % with the TPU 8t. Thanks to automatic error detection, rerouting around faulty connections, and optical circuit switching (OCS), downtime is minimized, which in frontier training means the loss of days of active work.

TPU 8i: Engine for Inference and the Agent Era

While the TPU 8t solves massive parallel computations, the TPU 8i is optimized for fast responses. Each chip contains 384 MB of on-chip SRAM — triple that of Ironwood — and 288 GB of high-speed HBM memory. This combination is meant to eliminate the so-called “memory wall,” where processors wait for data from slower external memory.

Google cites 80 % better performance-to-price ratio for the inference chip compared to the previous generation. In practice, this means companies can serve nearly double the number of users at the same cost. For modern Mixture of Experts (MoE) models like Gemini, double the bandwidth between chips (19.2 Tb/s) and the new Boardfly topology, which reduces network diameter by more than 50 %, are also key.

A special Collectives Acceleration Engine (CAE) directly on the chip then reduces the latency of global operations by up to 5x. For the average user, this means faster responses from AI assistants and smoother work with agents that cooperate in real time.

Competition with Nvidia and Market Context

Google deliberately does not directly compare the performance of the new TPUs with Nvidia chips, which in itself suggests that it has not yet threatened the dominant position of the market leader. Nvidia still controls an estimated 80–90 % of the data center AI accelerator market. Google is nevertheless a significant player — DA Davidson analysts estimated in September 2025 that the TPU business together with Google DeepMind could be worth approximately $900 billion.

An interesting parallel development is the adoption of SRAM memory. In its March announcement of Groq 3 LPU — technology acquired through the $20 billion acquisition of startup Groq — Nvidia is also betting on large amounts of SRAM. Google’s TPU 8i is going the same way, indicating that SRAM is becoming a key technology for low-latency inference.

Demand for Google chips is growing. Anthropic has committed to using several gigawatts of TPU capacity, Citadel Securities is building quantitative research software on them, and all 17 national laboratories of the U.S. Department of Energy use AI software built on TPUs. For Czech and European companies, TPUs are available through Google Cloud in several European regions including Western Europe.

Energy Efficiency and European Context

In the context of the European Union, where data centers are under increased pressure due to energy consumption and carbon footprint, efficiency plays a key role. Google states that both the TPU 8t and TPU 8i offer up to 2x better performance per watt than Ironwood. The company also declares that its data centers today deliver six times more computing power per unit of electricity than five years ago.

For Czech developers and companies, it is relevant that both chips support standard frameworks — JAX, PyTorch, MaxText, SGLang, and vLLM — and also offer so-called bare metal access without virtualization overhead. Google promises general availability “later this year,” but has not yet published exact prices. The traditional model of renting computing capacity via Google Cloud is expected.

What It Means for the Future of AI Infrastructure

The eighth generation of TPU is not just a hardware iteration — it is a strategic shift. The specialization of chips for training and inference reflects the maturity of the AI industry, where it is no longer possible to effectively solve everything with a universal solution. For the European market, which under pressure from the EU AI Act and energy regulations is striving for sustainable AI development, the efficiency of TPU 8t/8i may be an interesting alternative to traditional GPU clusters.

Whether Google can significantly cut into Nvidia’s lead depends not only on raw performance, but also on the ecosystem, developer support, and availability in European cloud regions. While Nvidia dominates the market with universal GPUs, Google is betting on vertical integration: chips, network, software, and data centers designed as a single whole.

What is the difference between a TPU and a regular GPU from Nvidia?

TPU (Tensor Processing Unit) is a chip designed specifically for tensor operations, which form the core of neural networks. While a GPU from Nvidia is a more universal accelerator also suitable for graphics, gaming, and a wide range of computational tasks, TPUs are optimized primarily for machine learning. This means higher efficiency at lower energy consumption for AI tasks, but less flexibility for other applications.

Can an ordinary company buy a TPU 8t or 8i?

Not directly. Google does not sell TPUs as physical products, but rents their computing capacity through its Google Cloud platform. Companies and developers can order access to TPU clusters similarly to virtual servers. For Czech companies, they are available through European Google Cloud regions; the specific price list depends on the volume and type of tasks.

Why is SRAM memory important for inference?

SRAM (Static Random-Access Memory) is much faster than standard DRAM memory, but also more expensive and with lower capacity. During inference — that is, running a trained model — the goal is for the data needed for computation to be as close to the processor as possible. The more SRAM a chip contains, the less often it has to wait for slower external memory. This directly translates into faster responses from AI assistants and lower operating costs.