Monday, December 22, 2025

Chapter 1 The Roofline Model: The One Graph That Explains AI Hardware

 

Chapter 1

The Roofline Model: The One Graph That Explains AI Hardware

The Promise of This Chapter

If you understand this chapter, you will:

  • Instantly see why hardware underperforms

  • Understand why FLOPs alone are meaningless

  • Predict when GPUs or TPUs will help — and when they won’t

Everything else in this primer builds on this.


1.1 Why Peak Performance Numbers Lie

Hardware specs love big numbers:

  • “100 TFLOPs”

  • “1.5 TB/s bandwidth”

  • “Thousands of cores”

But real workloads almost never reach peak performance.

Why?

Because performance is limited by two things, not one:

  1. How fast you can compute

  2. How fast you can move data

The Roofline model puts both on a single graph.


1.2 The Two Limits That Matter

Limit 1: Compute Throughput

This is the best-case scenario:

  • Data is already available

  • Compute units are fully utilized

  • Nothing is waiting

This is the flat roof of the roofline.

Limit 2: Memory Bandwidth

This is the common case:

  • Compute units wait for data

  • Memory can’t feed them fast enough

This is the slanted roof.

Whichever limit you hit first determines performance.


1.3 Arithmetic Intensity: The Key Quantity

The Roofline model introduces one crucial idea:

Arithmetic Intensity = Computation / Data Movement

In simple terms:

  • How many math operations do you perform per byte fetched?

Examples:

  • Low intensity: vector add (touch data once)

  • High intensity: matrix multiply (reuse data many times)

This single number determines whether you are:

  • Memory-bound (data-starved)

  • Compute-bound (compute-limited)


1.4 The Roofline Graph (Conceptually)

Think of the graph like this:

  • X-axis: Arithmetic intensity (reuse)

  • Y-axis: Performance

There are two regions:

  • Slanted line → memory-bound

  • Flat line → compute-bound

Your workload moves right as you improve data reuse.


1.5 Why Matrix Multiply Is Special

Matrix multiplication sits far to the right:

  • Each input is reused many times

  • Data movement is amortized

  • Arithmetic intensity is high

This is why:

  • GPUs shine on GEMM

  • TPUs are built around GEMM

  • Deep learning maps everything to GEMM

The Roofline doesn’t prefer matrix multiply — physics does.


1.6 Why Most Code Is Memory-Bound

Many operations:

  • Touch data once

  • Do little math

  • Move on

Examples:

  • Elementwise ops

  • Reductions

  • Poorly tiled kernels

These sit on the left side of the roofline.

No amount of extra compute fixes this.
Only more reuse does.


1.7 How Hardware Designers Use the Roofline

Hardware architects ask:

  • Where will real workloads land?

  • How far left is memory-bound?

  • How expensive is data movement?

This leads directly to:

  • Large on-chip SRAMs

  • High-bandwidth memory (HBM)

  • Systolic arrays

  • Massive threading

The roofline explains why these features exist.


1.8 GPUs Through the Roofline Lens

GPUs assume:

  • Many workloads are memory-bound

  • Latency is unavoidable

So they:

  • Run thousands of threads

  • Switch work while waiting for data

  • Use caches and shared memory

GPUs hide memory latency.


1.9 TPUs Through the Roofline Lens

TPUs assume:

  • Data movement must be minimized

  • Workloads are predictable

So they:

  • Use fixed dataflow

  • Keep data stationary

  • Maximize on-chip reuse

TPUs avoid memory latency.

Same roofline. Different strategy.


1.10 The Most Important Insight

Here is the insight to underline:

Performance improves more by increasing arithmetic intensity than by increasing peak FLOPs.

This is why:

  • Algorithm design matters

  • Compiler scheduling matters

  • Memory layout matters more than ALU count


Chapter 1 Takeaway

If you remember one thing:

The Roofline model tells you why performance stops scaling — and what kind of hardware can fix it.

Before asking:

  • “Is this GPU fast enough?”

Ask:

  • “Where does my workload sit on the roofline?”


For Readers Who Want to Go Deeper 🔍

🟢 Conceptual

  • Hennessy & Patterson — Computer Architecture: A Quantitative Approach (Roofline model)

  • Williams et al. — Roofline: An Insightful Visual Performance Model

🟡 Architecture-Level

  • NVIDIA CUDA Programming Guide (memory hierarchy)

  • Jouppi et al. — In-Datacenter Performance Analysis of a TPU

🔴 Hardware / Circuit-Level

  • Rabaey — Digital Integrated Circuits

  • Chandrakasan — Low Power CMOS Design



No comments: