1. Why MoE breaks the systolic dream (recap, visually)
Dense model on TPU (happy path)
-
Tokens flow layer → layer
-
Nearest-neighbor communication
-
Torus topology works beautifully
-
Predictable, local traffic
MoE model (the problem)
-
Router decides expert assignment at runtime
-
Tokens scatter to experts across chips
-
Requires all-to-all communication
-
Traffic ignores torus locality
This violates every assumption systolic arrays are built on:
-
Static schedules
-
Local movement
-
Deterministic timing
2. What TPUs actually do today (the workarounds)
2.1 Router on vector units (escape hatch)
-
Routing logic runs on VPUs, not MXUs (systolic array)
-
This is instruction-driven, not dataflow
-
You momentarily abandon the systolic world
👉 This is already an admission:
MoE doesn’t fit cleanly into pure dataflow.
2.2 Expert replication (reduce communication)
Instead of:
-
Each expert lives on one chip
They do:
-
Replicate popular experts across chips
Tradeoff
-
✅ Less all-to-all traffic
-
❌ More memory usage
-
❌ Less parameter efficiency
-
❌ Breaks MoE’s original scaling promise
This is a bandwidth-for-memory swap.
2.3 Grouped / local routing (constrain entropy)
-
Tokens can only choose experts within a group
-
Groups align with topology neighborhoods
Result
-
Traffic becomes mostly local
-
But routing quality degrades
-
Experts become underutilized
This turns MoE into:
“Dense-with-extra-steps”
2.4 Overprovision bandwidth and eat the cost
TPU pods throw:
-
Massive bisection bandwidth
-
Custom optical interconnects
At the problem.
This works at Google scale, but:
-
💸 Very expensive
-
❌ Not elegant
-
❌ Not scalable for customers
3. The uncomfortable truth
TPUs did not eliminate the mismatch.
They route around it by temporarily leaving the dataflow model.
In other words:
| Phase | Execution model |
|---|---|
| Dense layers | Systolic dataflow (MXU) |
| Router | Instruction-driven (VPU) |
| Expert dispatch | Network collectives |
| Expert compute | Systolic again |
This is hybrid execution, not pure dataflow.
4. Why this matters for Nvidia / Groq-style inference
Here’s the key contrast:
TPU philosophy
-
Dataflow until it breaks
-
Then fall back to vector units + collectives
Groq-style philosophy
-
If it’s dynamic, don’t pretend it’s dataflow
-
Make routing explicit
-
Schedule communication statically when possible
-
Optimize for inference, not training
And Nvidia:
-
Owns NVLink
-
Owns collective primitives
-
Owns hybrid execution already
Meaning:
Nvidia is structurally better positioned for MoE-heavy inference than pure systolic designs.
5. Why MoE is fundamentally hostile to systolic arrays
This is the core reason no one has “solved” it:
| Systolic assumption | MoE reality |
|---|---|
| Fixed data paths | Dynamic routing |
| Local communication | Global shuffles |
| Static schedule | Runtime decisions |
| Uniform compute | Sparse, skewed load |
You can mitigate each — but not eliminate all simultaneously.
6. Bottom line (very precise)
TPUs have not found a clean architectural solution to MoE.
They rely on hybrid execution, expert replication, constrained routing, and brute-force bandwidth — all of which erode the original benefits of systolic dataflow.
That’s why:
-
MoE works best when inference dominates
-
With explicit scheduling
-
And fabric-first designs (NVLink-style) instead of topology-first ones (torus)
One-sentence takeaway
MoE doesn’t break TPUs — it forces them to stop being TPUs for part of the time.
-
Why MoE inference ≠ MoE training
-
Whether MoE survives long-context models
-
How NVLink-style fabrics change MoE economics
-
Or why dense models may come back
No comments:
Post a Comment