Monday, December 22, 2025

Chapter 6 Scaling Many Chips: When Communication Dominates

Chapter 6

Scaling Many Chips: When Communication Dominates

The Core Question

What happens when the model no longer fits on one chip?

Answer:

The bottleneck moves from memory to communication.

6.1 Why Scaling Breaks So Suddenly

Single-chip optimization assumes:

Fast local memory
High reuse
Predictable access

Multi-chip systems introduce:

Long wires
Serialization
Synchronization

Communication costs:

Scale with model size
Do not scale with compute improvements

This is why scaling efficiency collapses.

6.2 The Three Parallelism Strategies (Precisely)

1. Data Parallelism

Each chip has full model
Different data batches
Gradients are synchronized

Cost: AllReduce on gradients
Failure mode: Communication dominates

2. Model Parallelism

Model split across chips
Activations communicated

Cost: Activation traffic
Failure mode: Latency and imbalance

3. Pipeline Parallelism

Layers assigned to stages
Micro-batching

Cost: Bubbles and scheduling
Failure mode: Utilization loss

No strategy avoids communication — they just move it.

6.3 Why AllReduce Is the Silent Killer

AllReduce:

Touches every parameter
Requires synchronization
Is bandwidth-bound

Even with perfect overlap:

Communication grows with model size
Compute grows slower

This is why:

Faster interconnects matter more than faster ALUs
Topology matters
Network-aware scheduling exists

6.4 Scaling Efficiency (Why It Plateaus)

Two regimes:

Strong scaling: fixed problem size → diminishing returns
Weak scaling: fixed work per chip → better behavior

Most ML training wants strong scaling — and pays the price.

6.5 Why TPUs Scale Differently

TPUs benefit from:

Synchronous execution
Deterministic schedules
Designed-for-scale interconnects

This reduces:

Load imbalance
Communication overhead
Software complexity

But at the cost of:

Flexibility
Heterogeneous workloads

Chapter 6 Takeaway

At scale, performance is determined by how efficiently chips talk to each other — not how fast they compute.

Adhyayan

Monday, December 22, 2025

Chapter 6 Scaling Many Chips: When Communication Dominates

Chapter 6

Scaling Many Chips: When Communication Dominates

The Core Question

6.1 Why Scaling Breaks So Suddenly

6.2 The Three Parallelism Strategies (Precisely)

1. Data Parallelism

2. Model Parallelism

3. Pipeline Parallelism

6.3 Why AllReduce Is the Silent Killer

6.4 Scaling Efficiency (Why It Plateaus)

6.5 Why TPUs Scale Differently

Chapter 6 Takeaway

Deep References 🔍

🟢 Conceptual

🟡 Architecture

🔴 Hardware

No comments:

About Me

Popular Posts