battle-tested way successful primers (like TPU Deep Dive, Scaling Laws, CUDA blogs) are organized, and why they work.
1️⃣ Start From the Workload, Not the Hardware (Most Important)
Best primers do NOT start with GPUs or TPUs.
They start with “what problem are we trying to run fast?”
Chapter 0: Why Deep Learning Is Different
-
Why matrix multiply dominates
-
Why data movement > compute
-
Why CPUs fail at scale
This matches:
-
TPU Deep Dive
-
How to Scale Your Model
-
NVIDIA CUDA blogs
📌 Mental anchor:
“All accelerators exist to make GEMM cheap.”
2️⃣ Introduce the Roofline Model Early
This is the universal unifier.
Chapter 1: The Roofline Model (Plain English)
-
FLOPs vs memory bandwidth
-
Why faster ALUs don’t help
-
Why SRAM is gold
Once readers understand this:
-
GPUs make sense
-
TPUs make sense
-
HBM suddenly matters
📌 Almost every good hardware DL talk implicitly assumes roofline thinking.
3️⃣ One Canonical Operation: Matrix Multiply
Do NOT explain CNNs, RNNs, Transformers separately at first.
Chapter 2: Everything Is MatMul
-
Convolution → GEMM
-
Attention → GEMM
-
MLP → GEMM
Explain:
-
Tiling
-
Data reuse
-
Blocking
📌 This is how Google’s TPU paper is written.
4️⃣ Then Introduce Hardware as Answers to the Same Problem
Now you can show GPUs and TPUs as different solutions to the same constraints.
Chapter 3: GPU — Latency-Hiding Machines
-
Thousands of threads
-
SIMT
-
Caches + shared memory
-
Tensor Cores
Chapter 4: TPU — Dataflow Machines
-
Systolic arrays
-
Deterministic data movement
-
Large on-chip SRAM
-
Compiler-controlled scheduling
🧠Key insight:
GPU = “hide memory latency”
TPU = “avoid memory latency”
5️⃣ Put Scaling After Single-Chip Understanding
Most people get scaling wrong because they start here.
Chapter 5: Scaling One Chip
-
Batch size
-
Arithmetic intensity
-
Model parallelism vs data parallelism
Chapter 6: Scaling Many Chips
-
AllReduce
-
Interconnect bandwidth
-
Pipeline parallelism
This is where “How to Scale Your Model” fits naturally.
6️⃣ Precision & Sparsity as Optimization Knobs
Only now introduce:
-
FP32 → FP16 → BF16 → INT8
-
Sparsity
-
Quantization
📌 Explain them as:
“Ways to increase arithmetic intensity or reduce bandwidth pressure.”
7️⃣ One Unifying Visual Per Chapter
Great primers always have:
-
One killer diagram
-
One mental model
-
One takeaway sentence
Example:
-
GPU: “Massively threaded latency hider”
-
TPU: “Clocked matrix factory”
-
HBM: “Compute’s oxygen supply”
8️⃣ Where to Put This Primer (Practically)
Best Formats (Ranked)
-
Living Web Document (like TPU Deep Dive)
-
Easy diagrams
-
Linkable
-
Evolves with hardware
-
-
Blog Series
-
Each chapter = one post
-
Easy to share
-
-
Open PDF / GitHub Pages
-
Versioned
-
Community contributions
-
📌 Avoid:
-
Pure academic papers (too dense)
-
Video-only (hard to reference)
-
Slides without narrative
9️⃣ Reference Exemplars (Study These)
If you want to match the gold standard, study:
-
TPU Deep Dive (Google)
-
NVIDIA CUDA Blog
-
Scaling Laws & “How to Scale Your Model”
-
Eyeriss paper
-
Vivienne Sze’s MIT lectures
🔑 Final Recommendation (If You’re Writing This)
Structure it like this:
-
Workload → Constraints
-
Roofline
-
MatMul
-
GPU as a solution
-
TPU as a solution
-
Scaling
-
Precision & sparsity
-
What comes next
No comments:
Post a Comment