Monday, December 22, 2025

Chapter 3 GPUs: Latency-Hiding Machines

Chapter 3

GPUs: Latency-Hiding Machines

The Promise of This Chapter

After this chapter, readers will understand:

Why GPUs use thousands of threads
Why SIMT exists
Why GPUs look “messy” compared to TPUs
When GPUs shine — and when they don’t

Most importantly, GPUs will stop feeling ad-hoc.

3.1 The Core GPU Assumption

GPUs are built around one brutal assumption:

Memory latency is unavoidable.

Accessing DRAM takes:

Hundreds of cycles
Orders of magnitude more time than a MAC

GPUs don’t try to eliminate this latency.
They try to hide it.

3.2 Latency Hiding vs Latency Avoidance

There are two ways to deal with slow memory:

Strategy 1: Avoid It (TPU-style)

Keep data on chip
Control dataflow tightly
Require predictability

Strategy 2: Hide It (GPU-style)

Switch to other work while waiting
Assume irregular access
Rely on massive parallelism

GPUs choose Strategy 2.

3.3 The GPU Mental Model

Think of a GPU as:

A machine that runs so many threads that some are always ready to compute.

When one thread stalls:

Another runs
And another
And another

As long as some threads are runnable, compute units stay busy.

3.4 SIMT: One Instruction, Many Threads

GPUs use SIMT (Single Instruction, Multiple Threads).

Threads are grouped into warps
A warp executes one instruction at a time
Each thread has its own registers and data

This gives:

SIMD-like efficiency
Thread-level flexibility

But it also introduces:

Divergence penalties
Control-flow sensitivity

3.5 Why GPUs Need So Many Threads

Let’s do the math.

If:

Memory latency ≈ 400 cycles
Each warp executes one instruction per cycle

Then to hide latency:

You need hundreds of warps
Ready to swap in instantly

This is why GPUs:

Have enormous register files
Support tens of thousands of threads
Look absurd compared to CPUs

This is not excess.
It’s necessary.

3.6 The Streaming Multiprocessor (SM)

Each GPU is made of many Streaming Multiprocessors (SMs).

An SM contains:

ALUs
Tensor cores
Registers
Shared memory
Warp schedulers

Think of an SM as:

A latency-hiding engine with local scratchpad memory.

3.7 Memory Hierarchy: Managed Chaos

GPUs expose multiple memory levels:

Registers (private, fast)
Shared memory (programmer-managed)
L1 cache
L2 cache
Global memory (DRAM)

Why so many?

Because GPUs assume:

Some reuse exists
But not enough to fully control statically

So they:

Cache opportunistically
Let programmers manage locality when possible

3.8 Tensor Cores: Compute Is Cheap

Tensor cores are a direct consequence of Chapter 0.

Because compute is cheap:

GPUs add specialized MAC units
Support FP16, BF16, INT8
Trade precision for throughput

Tensor cores:

Do not fix memory problems
Only help once data is fed efficiently

They amplify — not replace — good data movement.

3.9 GPUs on the Roofline

On the Roofline graph:

GPUs raise the compute roof
Improve effective bandwidth via caching
Push workloads right via tiling

But:

Low-intensity workloads stay memory-bound
Irregular access hurts
Synchronization costs add up

GPUs reward good structure — but tolerate bad structure better than TPUs.

3.10 When GPUs Are the Right Tool

GPUs excel when:

Workloads are diverse
Models change frequently
Control flow is irregular
Development velocity matters

This is why GPUs dominate:

Research
Prototyping
Multi-tenant environments

Flexibility is the GPU’s superpower.

Chapter 3 Takeaway

If you remember one thing:

GPUs don’t make memory fast — they make waiting cheap.

Everything else follows from this.

For Readers Who Want to Go Deeper 🔍

🟢 Conceptual

Kirk & Hwu — Programming Massively Parallel Processors
NVIDIA CUDA Programming Guide

🟡 Architecture-Level

NVIDIA GPU architecture whitepapers (Volta → Hopper)
“Understanding GPU Microarchitecture” talks

🔴 Hardware / RTL-Level

SM scheduler design papers (ISCA / MICRO)
Register file & shared memory design (ISSCC)

Adhyayan