Monday, December 22, 2025

Chapter 2 Why Everything in Deep Learning Becomes Matrix Multiplication

 

Chapter 2

Why Everything in Deep Learning Becomes Matrix Multiplication

The Promise of This Chapter

After this chapter, readers will understand:

  • Why CNNs, Transformers, and MLPs all look different — but run the same

  • Why accelerators revolve around GEMM

  • Why “matrix multiply engines” dominate AI chips

This is where software abstractions collapse into hardware reality.


2.1 The Illusion of Diversity in Neural Networks

From the outside, models look very different:

  • Convolutions for images

  • Attention for language

  • MLPs everywhere

But hardware doesn’t see “layers” or “tokens.”

Hardware sees:

Regular, dense numerical operations over large arrays

Once performance matters, all roads lead to matrix multiplication.


2.2 Why Matrix Multiply Is the Perfect Hardware Workload

Matrix multiplication has three properties hardware loves:

  1. Massive data reuse

    • Each value is used many times

  2. Regular access patterns

    • Predictable, schedulable

  3. High arithmetic intensity

    • Lots of math per byte moved

On the Roofline graph, GEMM lives far to the right.

This is not a software choice.
It is a physical inevitability.


2.3 Convolutions → Matrix Multiply

A convolution:

  • Slides a small filter over an image

  • Accumulates dot products

By rearranging data (e.g., im2col):

  • Each convolution becomes a matrix multiply

  • Filters become one matrix

  • Image patches become another

This costs some extra memory — but buys:

  • Better reuse

  • Higher performance

  • Accelerator compatibility

Hardware prefers wasteful memory to wasted bandwidth.


2.4 Attention → Matrix Multiply

Transformers look exotic, but attention boils down to:

  1. Q × Kแต€

  2. Softmax

  3. Result × V

Steps (1) and (3) are matrix multiplications.

This is why:

  • Attention scales well on GPUs

  • TPUs excel at large sequence lengths

  • Optimizing attention = optimizing GEMM


2.5 MLPs Were Always Matrix Multiply

Fully connected layers are literally:

y = Wx + b

No transformation required.

This is why:

  • MLPs scale effortlessly

  • They dominate compute cost

  • They map cleanly to hardware


2.6 Why Accelerators Are “GEMM Machines”

Because:

  • GEMM maximizes arithmetic intensity

  • GEMM minimizes off-chip traffic

  • GEMM maps cleanly to systolic arrays and SIMD

So accelerators are designed around:

  • MAC arrays

  • Tensor cores

  • Systolic grids

Everything else is glue logic.


Chapter 2 Takeaway

Deep learning does not run on layers or graphs.
It runs on matrix multiplication.

Once you accept this, accelerator design becomes obvious.


For Readers Who Want to Go Deeper ๐Ÿ”

๐ŸŸข Conceptual

  • Goodfellow et al. — Deep Learning

  • Sze et al. — Efficient Processing of DNNs

๐ŸŸก Architecture-Level

  • Jouppi et al. — TPU paper

  • NVIDIA Tensor Core documentation

๐Ÿ”ด Implementation-Level

  • Eyeriss paper

  • Gemmini (RISC-V accelerator)


Interlude

A Concrete Roofline Example (With Numbers)

Let’s make the Roofline real.

Example Hardware

  • Peak compute: 100 TFLOPs

  • Memory bandwidth: 1 TB/s

This means:

  • To fully use compute, you need
    100 FLOPs per byte


Workload A: Vector Add

  • ~1 FLOP per byte

  • Performance limited to:
    1 TB/s × 1 FLOP/B = 1 TFLOP

➡ Uses 1% of peak compute
➡ Completely memory-bound


Workload B: Matrix Multiply

  • ~200 FLOPs per byte

  • Performance limited to:
    100 TFLOPs (compute roof)

➡ Fully utilizes hardware
➡ Compute-bound


Moral

Hardware speedups only matter if your workload moves right on the Roofline.


Canonical Roofline Figure (Use This Once Everywhere)

Figure Requirements

  • X-axis: Arithmetic Intensity (log scale)

  • Y-axis: Performance

  • One slanted bandwidth line

  • One flat compute roof

  • Two dots:

    • “Vector Add” (left)

    • “Matrix Multiply” (right)

This single figure explains:

  • GPUs

  • TPUs

  • Scaling limits

  • Why optimizations work


No comments: