Monday, December 22, 2025

Chapter 0 Why Deep Learning Needs Special hardware

Chapter 0

Why Deep Learning Needs Special Hardware

The One-Sentence Problem

Modern AI is limited not by how fast we can compute, but by how fast we can move data.

Everything else in this primer follows from that.

0.1 Why CPUs Were Enough — Until They Weren’t

For decades, computers were designed for:

Branch-heavy code
Small working datasets
Sequential execution

CPUs are amazing at:

Running operating systems
Handling unpredictable control flow
Doing a little bit of everything

But deep learning is the opposite.

0.2 What Deep Learning Actually Does

Under the hood, training and inference mostly repeat one operation:

Multiply large matrices and add the results

Whether it’s:

Image recognition
Speech
Translation
Chatbots

They all reduce to dense linear algebra.

There are very few branches.
There is enormous repetition.
The same data is reused again and again.

This is the key mismatch with CPUs.

0.3 Compute Is Cheap. Data Movement Is Not.

A useful mental rule:

Moving data costs 10–100× more energy than computing on it.

A multiply-add is cheap
Fetching data from memory is expensive
Fetching data from far-away memory is very expensive

As models grow:

Parameters no longer fit in caches
Memory bandwidth becomes the bottleneck
Adding more ALUs stops helping

This is why “just faster CPUs” failed.

0.4 The Hidden Enemy: Memory Bandwidth

Imagine a factory:

Machines work extremely fast
But parts arrive slowly on a conveyor belt

Adding more machines doesn’t help.
The belt is the problem.

In hardware terms:

The machines are compute units
The belt is memory bandwidth

Deep learning accelerators exist to fix this imbalance.

0.5 The Three Rules That Shape All AI Hardware

Every GPU, TPU, and accelerator follows these rules:

Maximize data reuse
- Use the same numbers many times before fetching new ones
Move data as little as possible
- Prefer on-chip memory over off-chip memory
Trade flexibility for throughput
- Do fewer things, but do them extremely fast

Once you see these rules, GPU and TPU designs stop looking mysterious.

0.6 A Preview of What’s Coming

In the next chapters, we will show:

Why matrix multiplication dominates everything
How the Roofline model explains performance limits
Why GPUs use thousands of threads
Why TPUs use systolic arrays
Why scaling models is harder than building fast chips

And most importantly:

GPUs and TPUs are not competitors —
they are different answers to the same constraints.

Chapter 0 Takeaway

If you remember only one thing:

Deep learning hardware exists to reduce data movement, not to increase raw compute.

Everything else is an implementation detail.

Next Chapter

Chapter 1: The Roofline Model — The One Graph That Explains All Accelerators

Adhyayan