Monday, December 22, 2025

Chapter 2 Why Everything in Deep Learning Becomes Matrix Multiplication

Chapter 2

Why Everything in Deep Learning Becomes Matrix Multiplication

The Promise of This Chapter

After this chapter, readers will understand:

Why CNNs, Transformers, and MLPs all look different — but run the same
Why accelerators revolve around GEMM
Why “matrix multiply engines” dominate AI chips

This is where software abstractions collapse into hardware reality.

2.1 The Illusion of Diversity in Neural Networks

From the outside, models look very different:

Convolutions for images
Attention for language
MLPs everywhere

But hardware doesn’t see “layers” or “tokens.”

Hardware sees:

Regular, dense numerical operations over large arrays

Once performance matters, all roads lead to matrix multiplication.

2.2 Why Matrix Multiply Is the Perfect Hardware Workload

Matrix multiplication has three properties hardware loves:

Massive data reuse
- Each value is used many times
Regular access patterns
- Predictable, schedulable
High arithmetic intensity
- Lots of math per byte moved

On the Roofline graph, GEMM lives far to the right.

This is not a software choice.
It is a physical inevitability.

2.3 Convolutions → Matrix Multiply

A convolution:

Slides a small filter over an image
Accumulates dot products

By rearranging data (e.g., im2col):

Each convolution becomes a matrix multiply
Filters become one matrix
Image patches become another

This costs some extra memory — but buys:

Better reuse
Higher performance
Accelerator compatibility

Hardware prefers wasteful memory to wasted bandwidth.

2.4 Attention → Matrix Multiply

Transformers look exotic, but attention boils down to:

Q × Kᵀ
Softmax
Result × V

Steps (1) and (3) are matrix multiplications.

This is why:

Attention scales well on GPUs
TPUs excel at large sequence lengths
Optimizing attention = optimizing GEMM

2.5 MLPs Were Always Matrix Multiply

Fully connected layers are literally:

y = Wx + b

No transformation required.

This is why:

MLPs scale effortlessly
They dominate compute cost
They map cleanly to hardware

2.6 Why Accelerators Are “GEMM Machines”

Because:

GEMM maximizes arithmetic intensity
GEMM minimizes off-chip traffic
GEMM maps cleanly to systolic arrays and SIMD

So accelerators are designed around:

MAC arrays
Tensor cores
Systolic grids

Everything else is glue logic.

Chapter 2 Takeaway

Deep learning does not run on layers or graphs.
It runs on matrix multiplication.

Once you accept this, accelerator design becomes obvious.

For Readers Who Want to Go Deeper 🔍

🟢 Conceptual

Goodfellow et al. — Deep Learning
Sze et al. — Efficient Processing of DNNs

🟡 Architecture-Level

Jouppi et al. — TPU paper
NVIDIA Tensor Core documentation

🔴 Implementation-Level

Eyeriss paper
Gemmini (RISC-V accelerator)

Interlude

A Concrete Roofline Example (With Numbers)

Let’s make the Roofline real.

Example Hardware

Peak compute: 100 TFLOPs
Memory bandwidth: 1 TB/s

This means:

To fully use compute, you need
100 FLOPs per byte

Workload A: Vector Add

~1 FLOP per byte
Performance limited to:
1 TB/s × 1 FLOP/B = 1 TFLOP

➡ Uses 1% of peak compute
➡ Completely memory-bound

Workload B: Matrix Multiply

~200 FLOPs per byte
Performance limited to:
100 TFLOPs (compute roof)

➡ Fully utilizes hardware
➡ Compute-bound

Moral

Hardware speedups only matter if your workload moves right on the Roofline.

Canonical Roofline Figure (Use This Once Everywhere)

Figure Requirements

X-axis: Arithmetic Intensity (log scale)
Y-axis: Performance
One slanted bandwidth line
One flat compute roof
Two dots:
- “Vector Add” (left)
- “Matrix Multiply” (right)

This single figure explains:

GPUs
TPUs
Scaling limits
Why optimizations work

Adhyayan