Monday, December 22, 2025

Chapter 4 TPUs: Dataflow Machines

Chapter 4

TPUs: Dataflow Machines

Where GPUs assume chaos, TPUs assume control — and the contrast finally becomes sharp.

The Promise of This Chapter

After this chapter, readers will understand:

Why TPUs look rigid compared to GPUs
What dataflow actually means
Why systolic arrays exist
When TPUs are unbeatable — and when they fail

This chapter completes the GPU ↔ TPU contrast.

4.1 The Core TPU Assumption

TPUs are built around a different assumption:

Memory latency must be avoided, not hidden.

Instead of reacting to memory stalls,
TPUs are designed so stalls rarely occur.

This single assumption changes everything.

4.2 What “Dataflow” Means (Plain English)

In a CPU or GPU:

Instructions pull data when they need it

In a TPU:

Data is scheduled to flow through compute units
Compute waits for data by design
Movement is explicit, predictable, and repeated

Think less “threads”
Think more factory assembly line.

4.3 The Systolic Array Mental Model

A systolic array is:

A grid of simple MAC units
Data pulses rhythmically through the grid
Each value is reused many times as it flows

Key idea:

Move data once. Compute on it many times.

This is the physical embodiment of
compute is cheap, data is expensive.

4.4 Why Systolic Arrays Are So Efficient

Systolic arrays:

Use short, local wires
Eliminate caches
Minimize control logic
Maximize reuse

The result:

Extremely high arithmetic intensity
Predictable performance
Excellent energy efficiency

This is why TPUs dominate large, dense GEMMs.

4.5 Explicit Memory Hierarchy

Unlike GPUs, TPUs:

Expose memory movement to the compiler
Rely on software scheduling
Avoid speculation

Memory is:

Loaded into on-chip SRAM
Streamed through compute
Written back in bulk

Nothing is accidental.

4.6 Control vs Flexibility

TPUs trade:

Generality ❌
Irregular control flow ❌

For:

Throughput ✅
Efficiency ✅
Determinism ✅

This is why:

TPUs thrive in production
GPUs dominate experimentation

4.7 TPUs on the Roofline

On the Roofline graph:

TPUs don’t just raise the roof
They reshape the workload

By enforcing reuse:

Workloads are pushed right
Memory-bound kernels become compute-bound

The hardware doesn’t wait for software to behave.
It forces it.

4.8 When TPUs Are the Right Tool

TPUs excel when:

Models are stable
Workloads are dense
Performance per watt matters
Scale is enormous

This is why TPUs power:

Search
Ads
Large-scale training
Cloud inference

Chapter 4 Takeaway

If you remember one thing:

TPUs don’t tolerate inefficiency — they design it out.

For Readers Who Want to Go Deeper 🔍

🟢 Conceptual

Jouppi et al. — In-Datacenter Performance Analysis of a TPU

🟡 Architecture-Level

Google TPU system architecture docs
Eyeriss dataflow taxonomy paper

🔴 Hardware-Level

Systolic array physical design (ISSCC)
SRAM-centric accelerator papers

GPU vs TPU — One-Page Truth Table

Dimension	GPU	TPU
Philosophy	Hide latency	Avoid latency
Parallelism	Massive threading	Spatial dataflow
Control	Dynamic	Static
Memory	Cache-heavy	SRAM-managed
Flexibility	High	Low
Efficiency	Good	Exceptional
Best for	Research, mixed workloads	Production, dense ML

Adhyayan