Chapter 0
Why Deep Learning Needs Special Hardware
The One-Sentence Problem
Modern AI is limited not by how fast we can compute, but by how fast we can move data.
Everything else in this primer follows from that.
0.1 Why CPUs Were Enough — Until They Weren’t
For decades, computers were designed for:
-
Branch-heavy code
-
Small working datasets
-
Sequential execution
CPUs are amazing at:
-
Running operating systems
-
Handling unpredictable control flow
-
Doing a little bit of everything
But deep learning is the opposite.
0.2 What Deep Learning Actually Does
Under the hood, training and inference mostly repeat one operation:
Multiply large matrices and add the results
Whether it’s:
-
Image recognition
-
Speech
-
Translation
-
Chatbots
They all reduce to dense linear algebra.
There are very few branches.
There is enormous repetition.
The same data is reused again and again.
This is the key mismatch with CPUs.
0.3 Compute Is Cheap. Data Movement Is Not.
A useful mental rule:
Moving data costs 10–100× more energy than computing on it.
-
A multiply-add is cheap
-
Fetching data from memory is expensive
-
Fetching data from far-away memory is very expensive
As models grow:
-
Parameters no longer fit in caches
-
Memory bandwidth becomes the bottleneck
-
Adding more ALUs stops helping
This is why “just faster CPUs” failed.
0.4 The Hidden Enemy: Memory Bandwidth
Imagine a factory:
-
Machines work extremely fast
-
But parts arrive slowly on a conveyor belt
Adding more machines doesn’t help.
The belt is the problem.
In hardware terms:
-
The machines are compute units
-
The belt is memory bandwidth
Deep learning accelerators exist to fix this imbalance.
0.5 The Three Rules That Shape All AI Hardware
Every GPU, TPU, and accelerator follows these rules:
-
Maximize data reuse
-
Use the same numbers many times before fetching new ones
-
-
Move data as little as possible
-
Prefer on-chip memory over off-chip memory
-
-
Trade flexibility for throughput
-
Do fewer things, but do them extremely fast
-
Once you see these rules, GPU and TPU designs stop looking mysterious.
0.6 A Preview of What’s Coming
In the next chapters, we will show:
-
Why matrix multiplication dominates everything
-
How the Roofline model explains performance limits
-
Why GPUs use thousands of threads
-
Why TPUs use systolic arrays
-
Why scaling models is harder than building fast chips
And most importantly:
GPUs and TPUs are not competitors —
they are different answers to the same constraints.
Chapter 0 Takeaway
If you remember only one thing:
Deep learning hardware exists to reduce data movement, not to increase raw compute.
Everything else is an implementation detail.
Next Chapter
Chapter 1: The Roofline Model — The One Graph That Explains All Accelerators
No comments:
Post a Comment