Monday, December 22, 2025

Before the beginning

 battle-tested way successful primers (like TPU Deep Dive, Scaling Laws, CUDA blogs) are organized, and why they work.


1️⃣ Start From the Workload, Not the Hardware (Most Important)

Best primers do NOT start with GPUs or TPUs.
They start with “what problem are we trying to run fast?”

Chapter 0: Why Deep Learning Is Different

  • Why matrix multiply dominates

  • Why data movement > compute

  • Why CPUs fail at scale

This matches:

  • TPU Deep Dive

  • How to Scale Your Model

  • NVIDIA CUDA blogs

📌 Mental anchor:

“All accelerators exist to make GEMM cheap.”


2️⃣ Introduce the Roofline Model Early

This is the universal unifier.

Chapter 1: The Roofline Model (Plain English)

  • FLOPs vs memory bandwidth

  • Why faster ALUs don’t help

  • Why SRAM is gold

Once readers understand this:

  • GPUs make sense

  • TPUs make sense

  • HBM suddenly matters

📌 Almost every good hardware DL talk implicitly assumes roofline thinking.


3️⃣ One Canonical Operation: Matrix Multiply

Do NOT explain CNNs, RNNs, Transformers separately at first.

Chapter 2: Everything Is MatMul

  • Convolution → GEMM

  • Attention → GEMM

  • MLP → GEMM

Explain:

  • Tiling

  • Data reuse

  • Blocking

📌 This is how Google’s TPU paper is written.


4️⃣ Then Introduce Hardware as Answers to the Same Problem

Now you can show GPUs and TPUs as different solutions to the same constraints.

Chapter 3: GPU — Latency-Hiding Machines

  • Thousands of threads

  • SIMT

  • Caches + shared memory

  • Tensor Cores

Chapter 4: TPU — Dataflow Machines

  • Systolic arrays

  • Deterministic data movement

  • Large on-chip SRAM

  • Compiler-controlled scheduling

🧠 Key insight:

GPU = “hide memory latency”
TPU = “avoid memory latency”


5️⃣ Put Scaling After Single-Chip Understanding

Most people get scaling wrong because they start here.

Chapter 5: Scaling One Chip

  • Batch size

  • Arithmetic intensity

  • Model parallelism vs data parallelism

Chapter 6: Scaling Many Chips

  • AllReduce

  • Interconnect bandwidth

  • Pipeline parallelism

This is where “How to Scale Your Model” fits naturally.


6️⃣ Precision & Sparsity as Optimization Knobs

Only now introduce:

  • FP32 → FP16 → BF16 → INT8

  • Sparsity

  • Quantization

📌 Explain them as:

“Ways to increase arithmetic intensity or reduce bandwidth pressure.”


7️⃣ One Unifying Visual Per Chapter

Great primers always have:

  • One killer diagram

  • One mental model

  • One takeaway sentence

Example:

  • GPU: “Massively threaded latency hider”

  • TPU: “Clocked matrix factory”

  • HBM: “Compute’s oxygen supply”


8️⃣ Where to Put This Primer (Practically)

Best Formats (Ranked)

  1. Living Web Document (like TPU Deep Dive)

    • Easy diagrams

    • Linkable

    • Evolves with hardware

  2. Blog Series

    • Each chapter = one post

    • Easy to share

  3. Open PDF / GitHub Pages

    • Versioned

    • Community contributions

📌 Avoid:

  • Pure academic papers (too dense)

  • Video-only (hard to reference)

  • Slides without narrative


9️⃣ Reference Exemplars (Study These)

If you want to match the gold standard, study:

  • TPU Deep Dive (Google)

  • NVIDIA CUDA Blog

  • Scaling Laws & “How to Scale Your Model”

  • Eyeriss paper

  • Vivienne Sze’s MIT lectures


🔑 Final Recommendation (If You’re Writing This)

Structure it like this:

  1. Workload → Constraints

  2. Roofline

  3. MatMul

  4. GPU as a solution

  5. TPU as a solution

  6. Scaling

  7. Precision & sparsity

  8. What comes next

No comments: