Monday, December 22, 2025

Chapter 6 Scaling Many Chips: When Communication Dominates

 

Chapter 6

Scaling Many Chips: When Communication Dominates

The Core Question

What happens when the model no longer fits on one chip?

Answer:

The bottleneck moves from memory to communication.


6.1 Why Scaling Breaks So Suddenly

Single-chip optimization assumes:

  • Fast local memory

  • High reuse

  • Predictable access

Multi-chip systems introduce:

  • Long wires

  • Serialization

  • Synchronization

Communication costs:

  • Scale with model size

  • Do not scale with compute improvements

This is why scaling efficiency collapses.


6.2 The Three Parallelism Strategies (Precisely)

1. Data Parallelism

  • Each chip has full model

  • Different data batches

  • Gradients are synchronized

Cost: AllReduce on gradients
Failure mode: Communication dominates


2. Model Parallelism

  • Model split across chips

  • Activations communicated

Cost: Activation traffic
Failure mode: Latency and imbalance


3. Pipeline Parallelism

  • Layers assigned to stages

  • Micro-batching

Cost: Bubbles and scheduling
Failure mode: Utilization loss

No strategy avoids communication — they just move it.


6.3 Why AllReduce Is the Silent Killer

AllReduce:

  • Touches every parameter

  • Requires synchronization

  • Is bandwidth-bound

Even with perfect overlap:

  • Communication grows with model size

  • Compute grows slower

This is why:

  • Faster interconnects matter more than faster ALUs

  • Topology matters

  • Network-aware scheduling exists


6.4 Scaling Efficiency (Why It Plateaus)

Two regimes:

  • Strong scaling: fixed problem size → diminishing returns

  • Weak scaling: fixed work per chip → better behavior

Most ML training wants strong scaling — and pays the price.


6.5 Why TPUs Scale Differently

TPUs benefit from:

  • Synchronous execution

  • Deterministic schedules

  • Designed-for-scale interconnects

This reduces:

  • Load imbalance

  • Communication overhead

  • Software complexity

But at the cost of:

  • Flexibility

  • Heterogeneous workloads


Chapter 6 Takeaway

At scale, performance is determined by how efficiently chips talk to each other — not how fast they compute.


Deep References 🔍

🟢 Conceptual

  • “How to Scale Your Model”

  • Scaling laws literature

🟡 Architecture

  • Megatron-LM

  • TPU Pod architecture

  • NVLink & NCCL docs

🔴 Hardware

  • Interconnect PHY design

  • Topology-aware routing papers


No comments: