Adhyayan: Before the beginning

Monday, December 22, 2025

Before the beginning

battle-tested way successful primers (like TPU Deep Dive, Scaling Laws, CUDA blogs) are organized, and why they work.

1️⃣ Start From the Workload, Not the Hardware (Most Important)

Best primers do NOT start with GPUs or TPUs.
They start with “what problem are we trying to run fast?”

Chapter 0: Why Deep Learning Is Different

Why matrix multiply dominates
Why data movement > compute
Why CPUs fail at scale

This matches:

TPU Deep Dive
How to Scale Your Model
NVIDIA CUDA blogs

📌 Mental anchor:

“All accelerators exist to make GEMM cheap.”

2️⃣ Introduce the Roofline Model Early

This is the universal unifier.

Chapter 1: The Roofline Model (Plain English)

FLOPs vs memory bandwidth
Why faster ALUs don’t help
Why SRAM is gold

Once readers understand this:

GPUs make sense
TPUs make sense
HBM suddenly matters

📌 Almost every good hardware DL talk implicitly assumes roofline thinking.

3️⃣ One Canonical Operation: Matrix Multiply

Do NOT explain CNNs, RNNs, Transformers separately at first.

Chapter 2: Everything Is MatMul

Convolution → GEMM
Attention → GEMM
MLP → GEMM

Explain:

Tiling
Data reuse
Blocking

📌 This is how Google’s TPU paper is written.

4️⃣ Then Introduce Hardware as Answers to the Same Problem

Now you can show GPUs and TPUs as different solutions to the same constraints.

Chapter 3: GPU — Latency-Hiding Machines

Thousands of threads
SIMT
Caches + shared memory
Tensor Cores

Chapter 4: TPU — Dataflow Machines

Systolic arrays
Deterministic data movement
Large on-chip SRAM
Compiler-controlled scheduling

🧠 Key insight:

GPU = “hide memory latency”
TPU = “avoid memory latency”

5️⃣ Put Scaling After Single-Chip Understanding

Most people get scaling wrong because they start here.

Chapter 5: Scaling One Chip

Batch size
Arithmetic intensity
Model parallelism vs data parallelism

Chapter 6: Scaling Many Chips

AllReduce
Interconnect bandwidth
Pipeline parallelism

This is where “How to Scale Your Model” fits naturally.

6️⃣ Precision & Sparsity as Optimization Knobs

Only now introduce:

FP32 → FP16 → BF16 → INT8
Sparsity
Quantization

📌 Explain them as:

“Ways to increase arithmetic intensity or reduce bandwidth pressure.”

7️⃣ One Unifying Visual Per Chapter

Great primers always have:

One killer diagram
One mental model
One takeaway sentence

Example:

GPU: “Massively threaded latency hider”
TPU: “Clocked matrix factory”
HBM: “Compute’s oxygen supply”

8️⃣ Where to Put This Primer (Practically)

Best Formats (Ranked)

Living Web Document (like TPU Deep Dive)
- Easy diagrams
- Linkable
- Evolves with hardware
Blog Series
- Each chapter = one post
- Easy to share
Open PDF / GitHub Pages
- Versioned
- Community contributions

📌 Avoid:

Pure academic papers (too dense)
Video-only (hard to reference)
Slides without narrative

9️⃣ Reference Exemplars (Study These)

If you want to match the gold standard, study:

TPU Deep Dive (Google)
NVIDIA CUDA Blog
Scaling Laws & “How to Scale Your Model”
Eyeriss paper
Vivienne Sze’s MIT lectures

🔑 Final Recommendation (If You’re Writing This)

Structure it like this:

Workload → Constraints
Roofline
MatMul
GPU as a solution
TPU as a solution
Scaling
Precision & sparsity
What comes next

Adhyayan