Monday, December 22, 2025

Chapter 5 Scaling One Chip: Extracting Performance from a Fixed Budget

Chapter 5

Scaling One Chip: Extracting Performance from a Fixed Budget

The Core Question

If compute is cheap and data is expensive, how do we fully utilize a single accelerator?

This chapter explains why most kernels underperform, and why performance tuning is really data movement engineering.

5.1 The Real Performance Goal (Reframed)

Performance tuning is often described as:

“Optimizing code”
“Using faster instructions”
“Increasing occupancy”

But from Chapters 0–2, we know the truth:

Performance improves when arithmetic intensity increases.

Every single-chip optimization tries to:

Reduce off-chip memory accesses
Increase reuse of on-chip data
Convert memory stalls into useful work

5.2 Tiling Is Not Optional — It Is the Algorithm

Consider naive matrix multiplication:

Each element is loaded repeatedly from DRAM
Arithmetic intensity is low
Performance is memory-bound

Tiling changes the algorithm, not just performance

By dividing matrices into tiles that fit in:

Registers
Shared memory (GPU)
SRAM (TPU)

You:

Load each tile once
Reuse it many times
Pay the DRAM cost only once

An untilled GEMM is not a “slow version” of GEMM — it is the wrong algorithm.

This is why:

Libraries matter more than compilers
Hardware exposes scratchpads
Tile size is architecture-specific

5.3 On-Chip Memory Is the Real Accelerator

On modern accelerators:

Registers and SRAM dominate silicon area
ALUs are comparatively cheap
Memory bandwidth limits utilization

GPU perspective

Registers: private, fastest
Shared memory: programmer-controlled reuse
Caches: opportunistic

TPU perspective

Large explicit SRAM
Compiler-managed
No speculation

The fastest MAC is the one that never needs to fetch data again.

5.4 Operator Fusion: Reducing Round Trips

Many DL graphs look like:


GEMM → activation → normalization → elementwise ops

Naively:

Each step writes to memory
Next step reads it back

Fusion:

Keeps intermediates on chip
Eliminates round trips
Increases arithmetic intensity

This is why:

XLA exists
CUDA graphs exist
Compiler sophistication matters

5.5 Recompute vs Store: Spending Compute to Save Bandwidth

A counterintuitive idea:

Sometimes recomputing values is cheaper than storing them.

Why?

Storing requires memory writes
Reloading costs bandwidth and energy
Recompute costs cheap MACs

This tradeoff:

Appears in activation checkpointing
Is essential in large models
Is invisible if you think only in FLOPs

This principle only makes sense once you accept Chapter 0.

5.6 Batch Size, Utilization, and Latency

Batch size:

Increases reuse
Improves utilization
Moves workload right on Roofline

But:

Inference latency suffers
Memory pressure increases

This creates a throughput–latency tradeoff:

Training favors large batches
Real-time inference favors small batches

There is no free solution — only informed tradeoffs.

Chapter 5 Takeaway

Single-chip performance comes from structuring computation to maximize on-chip reuse — not from adding more compute.

Adhyayan

Monday, December 22, 2025

Chapter 5 Scaling One Chip: Extracting Performance from a Fixed Budget

Chapter 5

Scaling One Chip: Extracting Performance from a Fixed Budget

The Core Question

5.1 The Real Performance Goal (Reframed)

5.2 Tiling Is Not Optional — It Is the Algorithm

Tiling changes the algorithm, not just performance

5.3 On-Chip Memory Is the Real Accelerator

GPU perspective

TPU perspective

5.4 Operator Fusion: Reducing Round Trips

5.5 Recompute vs Store: Spending Compute to Save Bandwidth

5.6 Batch Size, Utilization, and Latency

Chapter 5 Takeaway

Deep References 🔍

🟢 Conceptual

🟡 Architecture

🔴 Hardware

No comments:

About Me

Popular Posts