Chapter 5
Scaling One Chip: Extracting Performance from a Fixed Budget
The Core Question
If compute is cheap and data is expensive, how do we fully utilize a single accelerator?
This chapter explains why most kernels underperform, and why performance tuning is really data movement engineering.
5.1 The Real Performance Goal (Reframed)
Performance tuning is often described as:
-
“Optimizing code”
-
“Using faster instructions”
-
“Increasing occupancy”
But from Chapters 0–2, we know the truth:
Performance improves when arithmetic intensity increases.
Every single-chip optimization tries to:
-
Reduce off-chip memory accesses
-
Increase reuse of on-chip data
-
Convert memory stalls into useful work
5.2 Tiling Is Not Optional — It Is the Algorithm
Consider naive matrix multiplication:
-
Each element is loaded repeatedly from DRAM
-
Arithmetic intensity is low
-
Performance is memory-bound
Tiling changes the algorithm, not just performance
By dividing matrices into tiles that fit in:
-
Registers
-
Shared memory (GPU)
-
SRAM (TPU)
You:
-
Load each tile once
-
Reuse it many times
-
Pay the DRAM cost only once
An untilled GEMM is not a “slow version” of GEMM — it is the wrong algorithm.
This is why:
-
Libraries matter more than compilers
-
Hardware exposes scratchpads
-
Tile size is architecture-specific
5.3 On-Chip Memory Is the Real Accelerator
On modern accelerators:
-
Registers and SRAM dominate silicon area
-
ALUs are comparatively cheap
-
Memory bandwidth limits utilization
GPU perspective
-
Registers: private, fastest
-
Shared memory: programmer-controlled reuse
-
Caches: opportunistic
TPU perspective
-
Large explicit SRAM
-
Compiler-managed
-
No speculation
The fastest MAC is the one that never needs to fetch data again.
5.4 Operator Fusion: Reducing Round Trips
Many DL graphs look like:
Naively:
-
Each step writes to memory
-
Next step reads it back
Fusion:
-
Keeps intermediates on chip
-
Eliminates round trips
-
Increases arithmetic intensity
This is why:
-
XLA exists
-
CUDA graphs exist
-
Compiler sophistication matters
5.5 Recompute vs Store: Spending Compute to Save Bandwidth
A counterintuitive idea:
Sometimes recomputing values is cheaper than storing them.
Why?
-
Storing requires memory writes
-
Reloading costs bandwidth and energy
-
Recompute costs cheap MACs
This tradeoff:
-
Appears in activation checkpointing
-
Is essential in large models
-
Is invisible if you think only in FLOPs
This principle only makes sense once you accept Chapter 0.
5.6 Batch Size, Utilization, and Latency
Batch size:
-
Increases reuse
-
Improves utilization
-
Moves workload right on Roofline
But:
-
Inference latency suffers
-
Memory pressure increases
This creates a throughput–latency tradeoff:
-
Training favors large batches
-
Real-time inference favors small batches
There is no free solution — only informed tradeoffs.
Chapter 5 Takeaway
Single-chip performance comes from structuring computation to maximize on-chip reuse — not from adding more compute.
Deep References 🔍
🟢 Conceptual
-
Hennessy & Patterson — Roofline & locality
-
Sze — Efficient Processing of DNNs
🟡 Architecture
-
NVIDIA CUDA Optimization Guide
-
XLA compiler docs
🔴 Hardware
-
SRAM energy models (ISSCC)
-
Scratchpad vs cache design papers
No comments:
Post a Comment