Monday, December 22, 2025

Chapter 0 Why Deep Learning Needs Special hardware

 

Chapter 0

Why Deep Learning Needs Special Hardware

The One-Sentence Problem

Modern AI is limited not by how fast we can compute, but by how fast we can move data.

Everything else in this primer follows from that.


0.1 Why CPUs Were Enough — Until They Weren’t

For decades, computers were designed for:

  • Branch-heavy code

  • Small working datasets

  • Sequential execution

CPUs are amazing at:

  • Running operating systems

  • Handling unpredictable control flow

  • Doing a little bit of everything

But deep learning is the opposite.


0.2 What Deep Learning Actually Does

Under the hood, training and inference mostly repeat one operation:

Multiply large matrices and add the results

Whether it’s:

  • Image recognition

  • Speech

  • Translation

  • Chatbots

They all reduce to dense linear algebra.

There are very few branches.
There is enormous repetition.
The same data is reused again and again.

This is the key mismatch with CPUs.


0.3 Compute Is Cheap. Data Movement Is Not.

A useful mental rule:

Moving data costs 10–100× more energy than computing on it.

  • A multiply-add is cheap

  • Fetching data from memory is expensive

  • Fetching data from far-away memory is very expensive

As models grow:

  • Parameters no longer fit in caches

  • Memory bandwidth becomes the bottleneck

  • Adding more ALUs stops helping

This is why “just faster CPUs” failed.


0.4 The Hidden Enemy: Memory Bandwidth

Imagine a factory:

  • Machines work extremely fast

  • But parts arrive slowly on a conveyor belt

Adding more machines doesn’t help.
The belt is the problem.

In hardware terms:

  • The machines are compute units

  • The belt is memory bandwidth

Deep learning accelerators exist to fix this imbalance.


0.5 The Three Rules That Shape All AI Hardware

Every GPU, TPU, and accelerator follows these rules:

  1. Maximize data reuse

    • Use the same numbers many times before fetching new ones

  2. Move data as little as possible

    • Prefer on-chip memory over off-chip memory

  3. Trade flexibility for throughput

    • Do fewer things, but do them extremely fast

Once you see these rules, GPU and TPU designs stop looking mysterious.


0.6 A Preview of What’s Coming

In the next chapters, we will show:

  • Why matrix multiplication dominates everything

  • How the Roofline model explains performance limits

  • Why GPUs use thousands of threads

  • Why TPUs use systolic arrays

  • Why scaling models is harder than building fast chips

And most importantly:

GPUs and TPUs are not competitors —
they are different answers to the same constraints.


Chapter 0 Takeaway

If you remember only one thing:

Deep learning hardware exists to reduce data movement, not to increase raw compute.

Everything else is an implementation detail.


Next Chapter

Chapter 1: The Roofline Model — The One Graph That Explains All Accelerators


No comments: