Chapter 1
The Roofline Model: The One Graph That Explains AI Hardware
The Promise of This Chapter
If you understand this chapter, you will:
Instantly see why hardware underperforms
Understand why FLOPs alone are meaningless
Predict when GPUs or TPUs will help — and when they won’t
Everything else in this primer builds on this.
1.1 Why Peak Performance Numbers Lie
Hardware specs love big numbers:
“100 TFLOPs”
“1.5 TB/s bandwidth”
“Thousands of cores”
But real workloads almost never reach peak performance.
Why?
Because performance is limited by two things, not one:
How fast you can compute
How fast you can move data
The Roofline model puts both on a single graph.
1.2 The Two Limits That Matter
Limit 1: Compute Throughput
This is the best-case scenario:
Data is already available
Compute units are fully utilized
Nothing is waiting
This is the flat roof of the roofline.
Limit 2: Memory Bandwidth
This is the common case:
Compute units wait for data
Memory can’t feed them fast enough
This is the slanted roof.
Whichever limit you hit first determines performance.
1.3 Arithmetic Intensity: The Key Quantity
The Roofline model introduces one crucial idea:
Arithmetic Intensity = Computation / Data Movement
In simple terms:
How many math operations do you perform per byte fetched?
Examples:
Low intensity: vector add (touch data once)
High intensity: matrix multiply (reuse data many times)
This single number determines whether you are:
Memory-bound (data-starved)
Compute-bound (compute-limited)
1.4 The Roofline Graph (Conceptually)
Think of the graph like this:
X-axis: Arithmetic intensity (reuse)
Y-axis: Performance
There are two regions:
Slanted line → memory-bound
Flat line → compute-bound
Your workload moves right as you improve data reuse.
1.5 Why Matrix Multiply Is Special
Matrix multiplication sits far to the right:
Each input is reused many times
Data movement is amortized
Arithmetic intensity is high
This is why:
GPUs shine on GEMM
TPUs are built around GEMM
Deep learning maps everything to GEMM
The Roofline doesn’t prefer matrix multiply — physics does.
1.6 Why Most Code Is Memory-Bound
Many operations:
Touch data once
Do little math
Move on
Examples:
Elementwise ops
Reductions
Poorly tiled kernels
These sit on the left side of the roofline.
No amount of extra compute fixes this.
Only more reuse does.
1.7 How Hardware Designers Use the Roofline
Hardware architects ask:
Where will real workloads land?
How far left is memory-bound?
How expensive is data movement?
This leads directly to:
Large on-chip SRAMs
High-bandwidth memory (HBM)
Systolic arrays
Massive threading
The roofline explains why these features exist.
1.8 GPUs Through the Roofline Lens
GPUs assume:
Many workloads are memory-bound
Latency is unavoidable
So they:
Run thousands of threads
Switch work while waiting for data
Use caches and shared memory
GPUs hide memory latency.
1.9 TPUs Through the Roofline Lens
TPUs assume:
Data movement must be minimized
Workloads are predictable
So they:
Use fixed dataflow
Keep data stationary
Maximize on-chip reuse
TPUs avoid memory latency.
Same roofline. Different strategy.
1.10 The Most Important Insight
Here is the insight to underline:
Performance improves more by increasing arithmetic intensity than by increasing peak FLOPs.
This is why:
Algorithm design matters
Compiler scheduling matters
Memory layout matters more than ALU count
Chapter 1 Takeaway
If you remember one thing:
The Roofline model tells you why performance stops scaling — and what kind of hardware can fix it.
Before asking:
“Is this GPU fast enough?”
Ask:
“Where does my workload sit on the roofline?”
For Readers Who Want to Go Deeper 🔍
🟢 Conceptual
Hennessy & Patterson — Computer Architecture: A Quantitative Approach (Roofline model)
Williams et al. — Roofline: An Insightful Visual Performance Model
🟡 Architecture-Level
NVIDIA CUDA Programming Guide (memory hierarchy)
Jouppi et al. — In-Datacenter Performance Analysis of a TPU
🔴 Hardware / Circuit-Level
Rabaey — Digital Integrated Circuits
Chandrakasan — Low Power CMOS Design
No comments:
Post a Comment