A hardware-first-path-to-understanding GPUs, TPUs, and LLMs
How matrix multiplication, kernels, and accelerator design explain modern machine learning systems
d2l.ai is about what work you ask the hardware to do; hardware texts are about how the machine executes that work.
They meet in the middle — but they answer different questions.
Let me explain the connection clearly, and when d2l.ai becomes useful if you care about hardware.
The clean mental separation
d2l.ai answers:
-
What is an MLP, CNN, Transformer?
-
What computations do they perform?
-
What tensors exist, and how do they flow?
-
What operations dominate training and inference?
Hardware resources answer:
-
How are those computations executed?
-
How is data moved and reused?
-
What limits throughput?
-
What architectural choices matter?
Think of d2l.ai as defining the workload, and hardware material as defining the machine.
The actual connection point: linear algebra
Almost everything in d2l.ai reduces to:
-
matrix multiplication
-
elementwise operations
-
reductions (sum, softmax, norm)
That’s the bridge.
An MLP in d2l.ai:
-
looks like layers and equations
-
is GEMMs + activation kernels on hardware
A Transformer block:
-
looks like attention and feedforward layers
-
is batched matmuls + memory movement + reductions
d2l.ai gives you semantic structure; hardware explains execution reality.
Why Chapter 7 (Patterson) feels “more GPU/TPU-like”
Chapter 7 teaches:
-
throughput machines
-
parallelism
-
data movement
-
accelerator design principles
That’s why it maps cleanly to GPUs and TPUs.
But it doesn’t tell you:
-
what workloads actually look like
-
why attention is painful for memory
-
why inference is sequential
d2l.ai fills that gap.
When d2l.ai becomes useful for a hardware-minded person
❌ Too early
If you read d2l.ai before understanding hardware:
-
models feel abstract
-
performance implications are invisible
-
everything looks “just math”
✅ Useful after hardware basics
Once you understand:
-
memory hierarchies
-
matmul dominance
-
bandwidth vs compute limits
d2l.ai suddenly becomes:
-
a workload specification
-
a way to reason about why hardware behaves the way it does
You start asking:
-
“How many matmuls does this layer induce?”
-
“Is this memory- or compute-bound?”
-
“Why does this scale poorly?”
That’s the real payoff.
Concrete example: MLP
In d2l.ai:
An MLP layer computes
Y = XW + b
In hardware terms:
-
XW→ GEMM kernel -
+ b→ elementwise kernel -
activation → elementwise kernel
-
memory reads dominate unless fused
Without hardware knowledge, you stop at the equation.
With hardware knowledge, you see:
-
kernel launches
-
memory traffic
-
fusion opportunities
Concrete example: Attention
In d2l.ai:
Attention = softmax(QKᵀ / √d) V
In hardware terms:
-
QKᵀ → large matmul
-
softmax → reduction + exp
-
memory-bound unless tiled
-
inference is sequential
This is where hardware intuition transforms d2l.ai from a textbook into a systems guide.
So how should you position d2l.ai?
As an extra reference, exactly as you said.
A clean description you can use:
Dive into Deep Learning (d2l.ai) is useful as a workload-level companion once hardware fundamentals are clear. It describes the structure of models and training loops, which can then be interpreted through a hardware and systems lens.
One-sentence takeaway
d2l.ai tells you what the model does; hardware tells you how expensive that is and why.
Once you have both, ML systems stop being mysterious.
No comments:
Post a Comment