Monday, December 22, 2025

Chapter 1 The Roofline Model: The One Graph That Explains AI Hardware

Chapter 1

The Roofline Model: The One Graph That Explains AI Hardware

The Promise of This Chapter

If you understand this chapter, you will:

Instantly see why hardware underperforms
Understand why FLOPs alone are meaningless
Predict when GPUs or TPUs will help — and when they won’t

Everything else in this primer builds on this.

1.1 Why Peak Performance Numbers Lie

Hardware specs love big numbers:

“100 TFLOPs”
“1.5 TB/s bandwidth”
“Thousands of cores”

But real workloads almost never reach peak performance.

Why?

Because performance is limited by two things, not one:

How fast you can compute
How fast you can move data

The Roofline model puts both on a single graph.

1.2 The Two Limits That Matter

Limit 1: Compute Throughput

This is the best-case scenario:

Data is already available
Compute units are fully utilized
Nothing is waiting

This is the flat roof of the roofline.

Limit 2: Memory Bandwidth

This is the common case:

Compute units wait for data
Memory can’t feed them fast enough

This is the slanted roof.

Whichever limit you hit first determines performance.

1.3 Arithmetic Intensity: The Key Quantity

The Roofline model introduces one crucial idea:

Arithmetic Intensity = Computation / Data Movement

In simple terms:

How many math operations do you perform per byte fetched?

Examples:

Low intensity: vector add (touch data once)
High intensity: matrix multiply (reuse data many times)

This single number determines whether you are:

Memory-bound (data-starved)
Compute-bound (compute-limited)

1.4 The Roofline Graph (Conceptually)

Think of the graph like this:

X-axis: Arithmetic intensity (reuse)
Y-axis: Performance

There are two regions:

Slanted line → memory-bound
Flat line → compute-bound

Your workload moves right as you improve data reuse.

1.5 Why Matrix Multiply Is Special

Matrix multiplication sits far to the right:

Each input is reused many times
Data movement is amortized
Arithmetic intensity is high

This is why:

GPUs shine on GEMM
TPUs are built around GEMM
Deep learning maps everything to GEMM

The Roofline doesn’t prefer matrix multiply — physics does.

1.6 Why Most Code Is Memory-Bound

Many operations:

Touch data once
Do little math
Move on

Examples:

Elementwise ops
Reductions
Poorly tiled kernels

These sit on the left side of the roofline.

No amount of extra compute fixes this.
Only more reuse does.

1.7 How Hardware Designers Use the Roofline

Hardware architects ask:

Where will real workloads land?
How far left is memory-bound?
How expensive is data movement?

This leads directly to:

Large on-chip SRAMs
High-bandwidth memory (HBM)
Systolic arrays
Massive threading

The roofline explains why these features exist.

1.8 GPUs Through the Roofline Lens

GPUs assume:

Many workloads are memory-bound
Latency is unavoidable

So they:

Run thousands of threads
Switch work while waiting for data
Use caches and shared memory

GPUs hide memory latency.

1.9 TPUs Through the Roofline Lens

TPUs assume:

Data movement must be minimized
Workloads are predictable

So they:

Use fixed dataflow
Keep data stationary
Maximize on-chip reuse

TPUs avoid memory latency.

Same roofline. Different strategy.

1.10 The Most Important Insight

Here is the insight to underline:

Performance improves more by increasing arithmetic intensity than by increasing peak FLOPs.

This is why:

Algorithm design matters
Compiler scheduling matters
Memory layout matters more than ALU count

Chapter 1 Takeaway

If you remember one thing:

The Roofline model tells you why performance stops scaling — and what kind of hardware can fix it.

Before asking:

“Is this GPU fast enough?”

Ask:

“Where does my workload sit on the roofline?”

For Readers Who Want to Go Deeper 🔍

🟢 Conceptual

Hennessy & Patterson — Computer Architecture: A Quantitative Approach (Roofline model)
Williams et al. — Roofline: An Insightful Visual Performance Model

Adhyayan