Chapter 6
Scaling Many Chips: When Communication Dominates
The Core Question
What happens when the model no longer fits on one chip?
Answer:
The bottleneck moves from memory to communication.
6.1 Why Scaling Breaks So Suddenly
Single-chip optimization assumes:
-
Fast local memory
-
High reuse
-
Predictable access
Multi-chip systems introduce:
-
Long wires
-
Serialization
-
Synchronization
Communication costs:
-
Scale with model size
-
Do not scale with compute improvements
This is why scaling efficiency collapses.
6.2 The Three Parallelism Strategies (Precisely)
1. Data Parallelism
-
Each chip has full model
-
Different data batches
-
Gradients are synchronized
Cost: AllReduce on gradients
Failure mode: Communication dominates
2. Model Parallelism
-
Model split across chips
-
Activations communicated
Cost: Activation traffic
Failure mode: Latency and imbalance
3. Pipeline Parallelism
-
Layers assigned to stages
-
Micro-batching
Cost: Bubbles and scheduling
Failure mode: Utilization loss
No strategy avoids communication — they just move it.
6.3 Why AllReduce Is the Silent Killer
AllReduce:
-
Touches every parameter
-
Requires synchronization
-
Is bandwidth-bound
Even with perfect overlap:
-
Communication grows with model size
-
Compute grows slower
This is why:
-
Faster interconnects matter more than faster ALUs
-
Topology matters
-
Network-aware scheduling exists
6.4 Scaling Efficiency (Why It Plateaus)
Two regimes:
-
Strong scaling: fixed problem size → diminishing returns
-
Weak scaling: fixed work per chip → better behavior
Most ML training wants strong scaling — and pays the price.
6.5 Why TPUs Scale Differently
TPUs benefit from:
-
Synchronous execution
-
Deterministic schedules
-
Designed-for-scale interconnects
This reduces:
-
Load imbalance
-
Communication overhead
-
Software complexity
But at the cost of:
-
Flexibility
-
Heterogeneous workloads
Chapter 6 Takeaway
At scale, performance is determined by how efficiently chips talk to each other — not how fast they compute.
Deep References 🔍
🟢 Conceptual
-
“How to Scale Your Model”
-
Scaling laws literature
🟡 Architecture
-
Megatron-LM
-
TPU Pod architecture
-
NVLink & NCCL docs
🔴 Hardware
-
Interconnect PHY design
-
Topology-aware routing papers
No comments:
Post a Comment