Monday, December 22, 2025

Chapter 4 TPUs: Dataflow Machines

 

Chapter 4

TPUs: Dataflow Machines

Where GPUs assume chaos, TPUs assume control — and the contrast finally becomes sharp.

The Promise of This Chapter

After this chapter, readers will understand:

  • Why TPUs look rigid compared to GPUs

  • What dataflow actually means

  • Why systolic arrays exist

  • When TPUs are unbeatable — and when they fail

This chapter completes the GPU ↔ TPU contrast.


4.1 The Core TPU Assumption

TPUs are built around a different assumption:

Memory latency must be avoided, not hidden.

Instead of reacting to memory stalls,
TPUs are designed so stalls rarely occur.

This single assumption changes everything.


4.2 What “Dataflow” Means (Plain English)

In a CPU or GPU:

  • Instructions pull data when they need it

In a TPU:

  • Data is scheduled to flow through compute units

  • Compute waits for data by design

  • Movement is explicit, predictable, and repeated

Think less “threads”
Think more factory assembly line.


4.3 The Systolic Array Mental Model

A systolic array is:

  • A grid of simple MAC units

  • Data pulses rhythmically through the grid

  • Each value is reused many times as it flows

Key idea:

Move data once. Compute on it many times.

This is the physical embodiment of
compute is cheap, data is expensive.


4.4 Why Systolic Arrays Are So Efficient

Systolic arrays:

  • Use short, local wires

  • Eliminate caches

  • Minimize control logic

  • Maximize reuse

The result:

  • Extremely high arithmetic intensity

  • Predictable performance

  • Excellent energy efficiency

This is why TPUs dominate large, dense GEMMs.


4.5 Explicit Memory Hierarchy

Unlike GPUs, TPUs:

  • Expose memory movement to the compiler

  • Rely on software scheduling

  • Avoid speculation

Memory is:

  • Loaded into on-chip SRAM

  • Streamed through compute

  • Written back in bulk

Nothing is accidental.


4.6 Control vs Flexibility

TPUs trade:

  • Generality ❌

  • Irregular control flow ❌

For:

  • Throughput ✅

  • Efficiency ✅

  • Determinism ✅

This is why:

  • TPUs thrive in production

  • GPUs dominate experimentation


4.7 TPUs on the Roofline

On the Roofline graph:

  • TPUs don’t just raise the roof

  • They reshape the workload

By enforcing reuse:

  • Workloads are pushed right

  • Memory-bound kernels become compute-bound

The hardware doesn’t wait for software to behave.
It forces it.


4.8 When TPUs Are the Right Tool

TPUs excel when:

  • Models are stable

  • Workloads are dense

  • Performance per watt matters

  • Scale is enormous

This is why TPUs power:

  • Search

  • Ads

  • Large-scale training

  • Cloud inference


Chapter 4 Takeaway

If you remember one thing:

TPUs don’t tolerate inefficiency — they design it out.


For Readers Who Want to Go Deeper 🔍

🟢 Conceptual

  • Jouppi et al. — In-Datacenter Performance Analysis of a TPU

🟡 Architecture-Level

  • Google TPU system architecture docs

  • Eyeriss dataflow taxonomy paper

🔴 Hardware-Level

  • Systolic array physical design (ISSCC)

  • SRAM-centric accelerator papers


GPU vs TPU — One-Page Truth Table

DimensionGPUTPU
PhilosophyHide latencyAvoid latency
ParallelismMassive threadingSpatial dataflow
ControlDynamicStatic
MemoryCache-heavySRAM-managed
FlexibilityHighLow
EfficiencyGoodExceptional
Best forResearch, mixed workloadsProduction, dense ML

No comments: