Chapter 2
Why Everything in Deep Learning Becomes Matrix Multiplication
The Promise of This Chapter
After this chapter, readers will understand:
-
Why CNNs, Transformers, and MLPs all look different — but run the same
-
Why accelerators revolve around GEMM
-
Why “matrix multiply engines” dominate AI chips
This is where software abstractions collapse into hardware reality.
2.1 The Illusion of Diversity in Neural Networks
From the outside, models look very different:
-
Convolutions for images
-
Attention for language
-
MLPs everywhere
But hardware doesn’t see “layers” or “tokens.”
Hardware sees:
Regular, dense numerical operations over large arrays
Once performance matters, all roads lead to matrix multiplication.
2.2 Why Matrix Multiply Is the Perfect Hardware Workload
Matrix multiplication has three properties hardware loves:
-
Massive data reuse
-
Each value is used many times
-
-
Regular access patterns
-
Predictable, schedulable
-
-
High arithmetic intensity
-
Lots of math per byte moved
-
On the Roofline graph, GEMM lives far to the right.
This is not a software choice.
It is a physical inevitability.
2.3 Convolutions → Matrix Multiply
A convolution:
-
Slides a small filter over an image
-
Accumulates dot products
By rearranging data (e.g., im2col):
-
Each convolution becomes a matrix multiply
-
Filters become one matrix
-
Image patches become another
This costs some extra memory — but buys:
-
Better reuse
-
Higher performance
-
Accelerator compatibility
Hardware prefers wasteful memory to wasted bandwidth.
2.4 Attention → Matrix Multiply
Transformers look exotic, but attention boils down to:
-
Q × Kแต
-
Softmax
-
Result × V
Steps (1) and (3) are matrix multiplications.
This is why:
-
Attention scales well on GPUs
-
TPUs excel at large sequence lengths
-
Optimizing attention = optimizing GEMM
2.5 MLPs Were Always Matrix Multiply
Fully connected layers are literally:
y = Wx + b
No transformation required.
This is why:
-
MLPs scale effortlessly
-
They dominate compute cost
-
They map cleanly to hardware
2.6 Why Accelerators Are “GEMM Machines”
Because:
-
GEMM maximizes arithmetic intensity
-
GEMM minimizes off-chip traffic
-
GEMM maps cleanly to systolic arrays and SIMD
So accelerators are designed around:
-
MAC arrays
-
Tensor cores
-
Systolic grids
Everything else is glue logic.
Chapter 2 Takeaway
Deep learning does not run on layers or graphs.
It runs on matrix multiplication.
Once you accept this, accelerator design becomes obvious.
For Readers Who Want to Go Deeper ๐
๐ข Conceptual
-
Goodfellow et al. — Deep Learning
-
Sze et al. — Efficient Processing of DNNs
๐ก Architecture-Level
-
Jouppi et al. — TPU paper
-
NVIDIA Tensor Core documentation
๐ด Implementation-Level
-
Eyeriss paper
-
Gemmini (RISC-V accelerator)
Interlude
A Concrete Roofline Example (With Numbers)
Let’s make the Roofline real.
Example Hardware
-
Peak compute: 100 TFLOPs
-
Memory bandwidth: 1 TB/s
This means:
-
To fully use compute, you need
100 FLOPs per byte
Workload A: Vector Add
-
~1 FLOP per byte
-
Performance limited to:
1 TB/s × 1 FLOP/B = 1 TFLOP
➡ Uses 1% of peak compute
➡ Completely memory-bound
Workload B: Matrix Multiply
-
~200 FLOPs per byte
-
Performance limited to:
100 TFLOPs (compute roof)
➡ Fully utilizes hardware
➡ Compute-bound
Moral
Hardware speedups only matter if your workload moves right on the Roofline.
Canonical Roofline Figure (Use This Once Everywhere)
Figure Requirements
-
X-axis: Arithmetic Intensity (log scale)
-
Y-axis: Performance
-
One slanted bandwidth line
-
One flat compute roof
-
Two dots:
-
“Vector Add” (left)
-
“Matrix Multiply” (right)
-
This single figure explains:
-
GPUs
-
TPUs
-
Scaling limits
-
Why optimizations work
No comments:
Post a Comment