Monday, December 22, 2025

Chapter 5 Scaling One Chip: Extracting Performance from a Fixed Budget

 

Chapter 5

Scaling One Chip: Extracting Performance from a Fixed Budget

The Core Question

If compute is cheap and data is expensive, how do we fully utilize a single accelerator?

This chapter explains why most kernels underperform, and why performance tuning is really data movement engineering.


5.1 The Real Performance Goal (Reframed)

Performance tuning is often described as:

  • “Optimizing code”

  • “Using faster instructions”

  • “Increasing occupancy”

But from Chapters 0–2, we know the truth:

Performance improves when arithmetic intensity increases.

Every single-chip optimization tries to:

  • Reduce off-chip memory accesses

  • Increase reuse of on-chip data

  • Convert memory stalls into useful work


5.2 Tiling Is Not Optional — It Is the Algorithm

Consider naive matrix multiplication:

  • Each element is loaded repeatedly from DRAM

  • Arithmetic intensity is low

  • Performance is memory-bound

Tiling changes the algorithm, not just performance

By dividing matrices into tiles that fit in:

  • Registers

  • Shared memory (GPU)

  • SRAM (TPU)

You:

  • Load each tile once

  • Reuse it many times

  • Pay the DRAM cost only once

An untilled GEMM is not a “slow version” of GEMM — it is the wrong algorithm.

This is why:

  • Libraries matter more than compilers

  • Hardware exposes scratchpads

  • Tile size is architecture-specific


5.3 On-Chip Memory Is the Real Accelerator

On modern accelerators:

  • Registers and SRAM dominate silicon area

  • ALUs are comparatively cheap

  • Memory bandwidth limits utilization

GPU perspective

  • Registers: private, fastest

  • Shared memory: programmer-controlled reuse

  • Caches: opportunistic

TPU perspective

  • Large explicit SRAM

  • Compiler-managed

  • No speculation

The fastest MAC is the one that never needs to fetch data again.


5.4 Operator Fusion: Reducing Round Trips

Many DL graphs look like:

GEMM → activation → normalization → elementwise ops

Naively:

  • Each step writes to memory

  • Next step reads it back

Fusion:

  • Keeps intermediates on chip

  • Eliminates round trips

  • Increases arithmetic intensity

This is why:

  • XLA exists

  • CUDA graphs exist

  • Compiler sophistication matters


5.5 Recompute vs Store: Spending Compute to Save Bandwidth

A counterintuitive idea:

Sometimes recomputing values is cheaper than storing them.

Why?

  • Storing requires memory writes

  • Reloading costs bandwidth and energy

  • Recompute costs cheap MACs

This tradeoff:

  • Appears in activation checkpointing

  • Is essential in large models

  • Is invisible if you think only in FLOPs

This principle only makes sense once you accept Chapter 0.


5.6 Batch Size, Utilization, and Latency

Batch size:

  • Increases reuse

  • Improves utilization

  • Moves workload right on Roofline

But:

  • Inference latency suffers

  • Memory pressure increases

This creates a throughput–latency tradeoff:

  • Training favors large batches

  • Real-time inference favors small batches

There is no free solution — only informed tradeoffs.


Chapter 5 Takeaway

Single-chip performance comes from structuring computation to maximize on-chip reuse — not from adding more compute.


Deep References 🔍

🟢 Conceptual

  • Hennessy & Patterson — Roofline & locality

  • Sze — Efficient Processing of DNNs

🟡 Architecture

  • NVIDIA CUDA Optimization Guide

  • XLA compiler docs

🔴 Hardware

  • SRAM energy models (ISSCC)

  • Scratchpad vs cache design papers


No comments: