Monday, December 22, 2025

Chapter 3 GPUs: Latency-Hiding Machines

 

Chapter 3

GPUs: Latency-Hiding Machines

The Promise of This Chapter

After this chapter, readers will understand:

  • Why GPUs use thousands of threads

  • Why SIMT exists

  • Why GPUs look “messy” compared to TPUs

  • When GPUs shine — and when they don’t

Most importantly, GPUs will stop feeling ad-hoc.


3.1 The Core GPU Assumption

GPUs are built around one brutal assumption:

Memory latency is unavoidable.

Accessing DRAM takes:

  • Hundreds of cycles

  • Orders of magnitude more time than a MAC

GPUs don’t try to eliminate this latency.
They try to hide it.


3.2 Latency Hiding vs Latency Avoidance

There are two ways to deal with slow memory:

Strategy 1: Avoid It (TPU-style)

  • Keep data on chip

  • Control dataflow tightly

  • Require predictability

Strategy 2: Hide It (GPU-style)

  • Switch to other work while waiting

  • Assume irregular access

  • Rely on massive parallelism

GPUs choose Strategy 2.


3.3 The GPU Mental Model

Think of a GPU as:

A machine that runs so many threads that some are always ready to compute.

When one thread stalls:

  • Another runs

  • And another

  • And another

As long as some threads are runnable, compute units stay busy.


3.4 SIMT: One Instruction, Many Threads

GPUs use SIMT (Single Instruction, Multiple Threads).

  • Threads are grouped into warps

  • A warp executes one instruction at a time

  • Each thread has its own registers and data

This gives:

  • SIMD-like efficiency

  • Thread-level flexibility

But it also introduces:

  • Divergence penalties

  • Control-flow sensitivity


3.5 Why GPUs Need So Many Threads

Let’s do the math.

If:

  • Memory latency ≈ 400 cycles

  • Each warp executes one instruction per cycle

Then to hide latency:

  • You need hundreds of warps

  • Ready to swap in instantly

This is why GPUs:

  • Have enormous register files

  • Support tens of thousands of threads

  • Look absurd compared to CPUs

This is not excess.
It’s necessary.


3.6 The Streaming Multiprocessor (SM)

Each GPU is made of many Streaming Multiprocessors (SMs).

An SM contains:

  • ALUs

  • Tensor cores

  • Registers

  • Shared memory

  • Warp schedulers

Think of an SM as:

A latency-hiding engine with local scratchpad memory.


3.7 Memory Hierarchy: Managed Chaos

GPUs expose multiple memory levels:

  • Registers (private, fast)

  • Shared memory (programmer-managed)

  • L1 cache

  • L2 cache

  • Global memory (DRAM)

Why so many?

Because GPUs assume:

  • Some reuse exists

  • But not enough to fully control statically

So they:

  • Cache opportunistically

  • Let programmers manage locality when possible


3.8 Tensor Cores: Compute Is Cheap

Tensor cores are a direct consequence of Chapter 0.

Because compute is cheap:

  • GPUs add specialized MAC units

  • Support FP16, BF16, INT8

  • Trade precision for throughput

Tensor cores:

  • Do not fix memory problems

  • Only help once data is fed efficiently

They amplify — not replace — good data movement.


3.9 GPUs on the Roofline

On the Roofline graph:

  • GPUs raise the compute roof

  • Improve effective bandwidth via caching

  • Push workloads right via tiling

But:

  • Low-intensity workloads stay memory-bound

  • Irregular access hurts

  • Synchronization costs add up

GPUs reward good structure — but tolerate bad structure better than TPUs.


3.10 When GPUs Are the Right Tool

GPUs excel when:

  • Workloads are diverse

  • Models change frequently

  • Control flow is irregular

  • Development velocity matters

This is why GPUs dominate:

  • Research

  • Prototyping

  • Multi-tenant environments

Flexibility is the GPU’s superpower.


Chapter 3 Takeaway

If you remember one thing:

GPUs don’t make memory fast — they make waiting cheap.

Everything else follows from this.


For Readers Who Want to Go Deeper 🔍

🟢 Conceptual

  • Kirk & Hwu — Programming Massively Parallel Processors

  • NVIDIA CUDA Programming Guide

🟡 Architecture-Level

  • NVIDIA GPU architecture whitepapers (Volta → Hopper)

  • “Understanding GPU Microarchitecture” talks

🔴 Hardware / RTL-Level

  • SM scheduler design papers (ISCA / MICRO)

  • Register file & shared memory design (ISSCC)


No comments: