Friday, December 26, 2025

Why MoE breaks the systolic dream

 

1. Why MoE breaks the systolic dream (recap, visually)

Dense model on TPU (happy path)

  • Tokens flow layer → layer

  • Nearest-neighbor communication

  • Torus topology works beautifully

  • Predictable, local traffic

MoE model (the problem)

  • Router decides expert assignment at runtime

  • Tokens scatter to experts across chips

  • Requires all-to-all communication

  • Traffic ignores torus locality

This violates every assumption systolic arrays are built on:

  • Static schedules

  • Local movement

  • Deterministic timing


2. What TPUs actually do today (the workarounds)

2.1 Router on vector units (escape hatch)

  • Routing logic runs on VPUs, not MXUs (systolic array)

  • This is instruction-driven, not dataflow

  • You momentarily abandon the systolic world

👉 This is already an admission:

MoE doesn’t fit cleanly into pure dataflow.


2.2 Expert replication (reduce communication)

Instead of:

  • Each expert lives on one chip

They do:

  • Replicate popular experts across chips

Tradeoff

  • ✅ Less all-to-all traffic

  • ❌ More memory usage

  • ❌ Less parameter efficiency

  • ❌ Breaks MoE’s original scaling promise

This is a bandwidth-for-memory swap.


2.3 Grouped / local routing (constrain entropy)

  • Tokens can only choose experts within a group

  • Groups align with topology neighborhoods

Result

  • Traffic becomes mostly local

  • But routing quality degrades

  • Experts become underutilized

This turns MoE into:

“Dense-with-extra-steps”


2.4 Overprovision bandwidth and eat the cost

TPU pods throw:

  • Massive bisection bandwidth

  • Custom optical interconnects

At the problem.

This works at Google scale, but:

  • 💸 Very expensive

  • ❌ Not elegant

  • ❌ Not scalable for customers


3. The uncomfortable truth

TPUs did not eliminate the mismatch.
They route around it by temporarily leaving the dataflow model.

In other words:

PhaseExecution model
Dense layersSystolic dataflow (MXU)
RouterInstruction-driven (VPU)
Expert dispatchNetwork collectives
Expert computeSystolic again

This is hybrid execution, not pure dataflow.


4. Why this matters for Nvidia / Groq-style inference

Here’s the key contrast:

TPU philosophy

  • Dataflow until it breaks

  • Then fall back to vector units + collectives

Groq-style philosophy

  • If it’s dynamic, don’t pretend it’s dataflow

  • Make routing explicit

  • Schedule communication statically when possible

  • Optimize for inference, not training

And Nvidia:

  • Owns NVLink

  • Owns collective primitives

  • Owns hybrid execution already

Meaning:

Nvidia is structurally better positioned for MoE-heavy inference than pure systolic designs.


5. Why MoE is fundamentally hostile to systolic arrays

This is the core reason no one has “solved” it:

Systolic assumptionMoE reality
Fixed data pathsDynamic routing
Local communicationGlobal shuffles
Static scheduleRuntime decisions
Uniform computeSparse, skewed load

You can mitigate each — but not eliminate all simultaneously.


6. Bottom line (very precise)

TPUs have not found a clean architectural solution to MoE.
They rely on hybrid execution, expert replication, constrained routing, and brute-force bandwidth — all of which erode the original benefits of systolic dataflow.

That’s why:

  • MoE works best when inference dominates

  • With explicit scheduling

  • And fabric-first designs (NVLink-style) instead of topology-first ones (torus)


One-sentence takeaway

MoE doesn’t break TPUs — it forces them to stop being TPUs for part of the time.


  • Why MoE inference ≠ MoE training

  • Whether MoE survives long-context models

  • How NVLink-style fabrics change MoE economics

  • Or why dense models may come back

No comments: