Chapter 3
GPUs: Latency-Hiding Machines
The Promise of This Chapter
After this chapter, readers will understand:
-
Why GPUs use thousands of threads
-
Why SIMT exists
-
Why GPUs look “messy” compared to TPUs
-
When GPUs shine — and when they don’t
Most importantly, GPUs will stop feeling ad-hoc.
3.1 The Core GPU Assumption
GPUs are built around one brutal assumption:
Memory latency is unavoidable.
Accessing DRAM takes:
-
Hundreds of cycles
-
Orders of magnitude more time than a MAC
GPUs don’t try to eliminate this latency.
They try to hide it.
3.2 Latency Hiding vs Latency Avoidance
There are two ways to deal with slow memory:
Strategy 1: Avoid It (TPU-style)
-
Keep data on chip
-
Control dataflow tightly
-
Require predictability
Strategy 2: Hide It (GPU-style)
-
Switch to other work while waiting
-
Assume irregular access
-
Rely on massive parallelism
GPUs choose Strategy 2.
3.3 The GPU Mental Model
Think of a GPU as:
A machine that runs so many threads that some are always ready to compute.
When one thread stalls:
-
Another runs
-
And another
-
And another
As long as some threads are runnable, compute units stay busy.
3.4 SIMT: One Instruction, Many Threads
GPUs use SIMT (Single Instruction, Multiple Threads).
-
Threads are grouped into warps
-
A warp executes one instruction at a time
-
Each thread has its own registers and data
This gives:
-
SIMD-like efficiency
-
Thread-level flexibility
But it also introduces:
-
Divergence penalties
-
Control-flow sensitivity
3.5 Why GPUs Need So Many Threads
Let’s do the math.
If:
-
Memory latency ≈ 400 cycles
-
Each warp executes one instruction per cycle
Then to hide latency:
-
You need hundreds of warps
-
Ready to swap in instantly
This is why GPUs:
-
Have enormous register files
-
Support tens of thousands of threads
-
Look absurd compared to CPUs
This is not excess.
It’s necessary.
3.6 The Streaming Multiprocessor (SM)
Each GPU is made of many Streaming Multiprocessors (SMs).
An SM contains:
-
ALUs
-
Tensor cores
-
Registers
-
Shared memory
-
Warp schedulers
Think of an SM as:
A latency-hiding engine with local scratchpad memory.
3.7 Memory Hierarchy: Managed Chaos
GPUs expose multiple memory levels:
-
Registers (private, fast)
-
Shared memory (programmer-managed)
-
L1 cache
-
L2 cache
-
Global memory (DRAM)
Why so many?
Because GPUs assume:
-
Some reuse exists
-
But not enough to fully control statically
So they:
-
Cache opportunistically
-
Let programmers manage locality when possible
3.8 Tensor Cores: Compute Is Cheap
Tensor cores are a direct consequence of Chapter 0.
Because compute is cheap:
-
GPUs add specialized MAC units
-
Support FP16, BF16, INT8
-
Trade precision for throughput
Tensor cores:
-
Do not fix memory problems
-
Only help once data is fed efficiently
They amplify — not replace — good data movement.
3.9 GPUs on the Roofline
On the Roofline graph:
-
GPUs raise the compute roof
-
Improve effective bandwidth via caching
-
Push workloads right via tiling
But:
-
Low-intensity workloads stay memory-bound
-
Irregular access hurts
-
Synchronization costs add up
GPUs reward good structure — but tolerate bad structure better than TPUs.
3.10 When GPUs Are the Right Tool
GPUs excel when:
-
Workloads are diverse
-
Models change frequently
-
Control flow is irregular
-
Development velocity matters
This is why GPUs dominate:
-
Research
-
Prototyping
-
Multi-tenant environments
Flexibility is the GPU’s superpower.
Chapter 3 Takeaway
If you remember one thing:
GPUs don’t make memory fast — they make waiting cheap.
Everything else follows from this.
For Readers Who Want to Go Deeper 🔍
🟢 Conceptual
-
Kirk & Hwu — Programming Massively Parallel Processors
-
NVIDIA CUDA Programming Guide
🟡 Architecture-Level
-
NVIDIA GPU architecture whitepapers (Volta → Hopper)
-
“Understanding GPU Microarchitecture” talks
🔴 Hardware / RTL-Level
-
SM scheduler design papers (ISCA / MICRO)
-
Register file & shared memory design (ISSCC)
No comments:
Post a Comment