Chapter 4
TPUs: Dataflow Machines
The Promise of This Chapter
After this chapter, readers will understand:
-
Why TPUs look rigid compared to GPUs
-
What dataflow actually means
-
Why systolic arrays exist
-
When TPUs are unbeatable — and when they fail
This chapter completes the GPU ↔ TPU contrast.
4.1 The Core TPU Assumption
TPUs are built around a different assumption:
Memory latency must be avoided, not hidden.
Instead of reacting to memory stalls,
TPUs are designed so stalls rarely occur.
This single assumption changes everything.
4.2 What “Dataflow” Means (Plain English)
In a CPU or GPU:
-
Instructions pull data when they need it
In a TPU:
-
Data is scheduled to flow through compute units
-
Compute waits for data by design
-
Movement is explicit, predictable, and repeated
Think less “threads”
Think more factory assembly line.
4.3 The Systolic Array Mental Model
A systolic array is:
-
A grid of simple MAC units
-
Data pulses rhythmically through the grid
-
Each value is reused many times as it flows
Key idea:
Move data once. Compute on it many times.
This is the physical embodiment of
compute is cheap, data is expensive.
4.4 Why Systolic Arrays Are So Efficient
Systolic arrays:
-
Use short, local wires
-
Eliminate caches
-
Minimize control logic
-
Maximize reuse
The result:
-
Extremely high arithmetic intensity
-
Predictable performance
-
Excellent energy efficiency
This is why TPUs dominate large, dense GEMMs.
4.5 Explicit Memory Hierarchy
Unlike GPUs, TPUs:
-
Expose memory movement to the compiler
-
Rely on software scheduling
-
Avoid speculation
Memory is:
-
Loaded into on-chip SRAM
-
Streamed through compute
-
Written back in bulk
Nothing is accidental.
4.6 Control vs Flexibility
TPUs trade:
-
Generality ❌
-
Irregular control flow ❌
For:
-
Throughput ✅
-
Efficiency ✅
-
Determinism ✅
This is why:
-
TPUs thrive in production
-
GPUs dominate experimentation
4.7 TPUs on the Roofline
On the Roofline graph:
-
TPUs don’t just raise the roof
-
They reshape the workload
By enforcing reuse:
-
Workloads are pushed right
-
Memory-bound kernels become compute-bound
The hardware doesn’t wait for software to behave.
It forces it.
4.8 When TPUs Are the Right Tool
TPUs excel when:
-
Models are stable
-
Workloads are dense
-
Performance per watt matters
-
Scale is enormous
This is why TPUs power:
-
Search
-
Ads
-
Large-scale training
-
Cloud inference
Chapter 4 Takeaway
If you remember one thing:
TPUs don’t tolerate inefficiency — they design it out.
For Readers Who Want to Go Deeper 🔍
🟢 Conceptual
-
Jouppi et al. — In-Datacenter Performance Analysis of a TPU
🟡 Architecture-Level
-
Google TPU system architecture docs
-
Eyeriss dataflow taxonomy paper
🔴 Hardware-Level
-
Systolic array physical design (ISSCC)
-
SRAM-centric accelerator papers
GPU vs TPU — One-Page Truth Table
| Dimension | GPU | TPU |
|---|---|---|
| Philosophy | Hide latency | Avoid latency |
| Parallelism | Massive threading | Spatial dataflow |
| Control | Dynamic | Static |
| Memory | Cache-heavy | SRAM-managed |
| Flexibility | High | Low |
| Efficiency | Good | Exceptional |
| Best for | Research, mixed workloads | Production, dense ML |
No comments:
Post a Comment