Adhyayan: Why MoE breaks the systolic dream

Friday, December 26, 2025

Why MoE breaks the systolic dream

1. Why MoE breaks the systolic dream (recap, visually)

Dense model on TPU (happy path)

Tokens flow layer → layer
Nearest-neighbor communication
Torus topology works beautifully
Predictable, local traffic

MoE model (the problem)

Router decides expert assignment at runtime
Tokens scatter to experts across chips
Requires all-to-all communication
Traffic ignores torus locality

This violates every assumption systolic arrays are built on:

Static schedules
Local movement
Deterministic timing

2. What TPUs actually do today (the workarounds)

2.1 Router on vector units (escape hatch)

Routing logic runs on VPUs, not MXUs (systolic array)
This is instruction-driven, not dataflow
You momentarily abandon the systolic world

👉 This is already an admission:

MoE doesn’t fit cleanly into pure dataflow.

2.2 Expert replication (reduce communication)

Instead of:

Each expert lives on one chip

They do:

Replicate popular experts across chips

Tradeoff

✅ Less all-to-all traffic
❌ More memory usage
❌ Less parameter efficiency
❌ Breaks MoE’s original scaling promise

This is a bandwidth-for-memory swap.

2.3 Grouped / local routing (constrain entropy)

Tokens can only choose experts within a group
Groups align with topology neighborhoods

Result

Traffic becomes mostly local
But routing quality degrades
Experts become underutilized

This turns MoE into:

“Dense-with-extra-steps”

2.4 Overprovision bandwidth and eat the cost

TPU pods throw:

Massive bisection bandwidth
Custom optical interconnects

At the problem.

This works at Google scale, but:

💸 Very expensive
❌ Not elegant
❌ Not scalable for customers

3. The uncomfortable truth

TPUs did not eliminate the mismatch.
They route around it by temporarily leaving the dataflow model.

In other words:

Phase	Execution model
Dense layers	Systolic dataflow (MXU)
Router	Instruction-driven (VPU)
Expert dispatch	Network collectives
Expert compute	Systolic again

This is hybrid execution, not pure dataflow.

4. Why this matters for Nvidia / Groq-style inference

Here’s the key contrast:

TPU philosophy

Dataflow until it breaks
Then fall back to vector units + collectives

Groq-style philosophy

If it’s dynamic, don’t pretend it’s dataflow
Make routing explicit
Schedule communication statically when possible
Optimize for inference, not training

And Nvidia:

Owns NVLink
Owns collective primitives
Owns hybrid execution already

Meaning:

Nvidia is structurally better positioned for MoE-heavy inference than pure systolic designs.

5. Why MoE is fundamentally hostile to systolic arrays

This is the core reason no one has “solved” it:

Systolic assumption	MoE reality
Fixed data paths	Dynamic routing
Local communication	Global shuffles
Static schedule	Runtime decisions
Uniform compute	Sparse, skewed load

You can mitigate each — but not eliminate all simultaneously.

6. Bottom line (very precise)

TPUs have not found a clean architectural solution to MoE.
They rely on hybrid execution, expert replication, constrained routing, and brute-force bandwidth — all of which erode the original benefits of systolic dataflow.

That’s why:

MoE works best when inference dominates
With explicit scheduling
And fabric-first designs (NVLink-style) instead of topology-first ones (torus)

One-sentence takeaway

MoE doesn’t break TPUs — it forces them to stop being TPUs for part of the time.

Why MoE inference ≠ MoE training
Whether MoE survives long-context models
How NVLink-style fabrics change MoE economics
Or why dense models may come back

Adhyayan