Friday, December 26, 2025

TPU vs GPU difference at microarchitecture level

 

1. GPU execution pipeline (why flexibility costs efficiency)

What a GPU is optimizing for

  • Many different programs

  • Many different access patterns

  • Many different users

What actually happens during a matmul on a GPU

  1. Instructions are fetched and decoded

  2. Warps are scheduled dynamically

  3. Data is pulled from HBM → L2 → shared memory

  4. Tensor Cores execute tiles

  5. Partial results are written back

  6. Synchronization barriers occur

  7. Repeat per kernel launch

Even with Tensor Cores:

  • Execution is instruction-driven

  • Memory is demand-fetched

  • Caches guess what you’ll need

🧠 Key point
The GPU is constantly deciding what to do next.

That decision logic costs:

  • Power

  • Area

  • Latency

  • Determinism


2. TPU execution pipeline (why it’s more efficient)

TPU has three main blocks

  • MXU – Matrix Unit (systolic array)

  • VPU – Vector Processing Unit (control / non-matmul ops)

  • SRAM – large on-chip scratchpads

What happens during a matmul on a TPU

  1. Data is preloaded into SRAM

  2. A and B values stream into the MXU

  3. Multiply-accumulate happens every cycle

  4. Partial sums stay local

  5. Results stream out — no cache, no stalls

No instruction fetch.
No dynamic scheduling.
No cache misses.

🧠 Key point
The TPU doesn’t decide — it just flows.


3. Systolic array vs Tensor Core (the heart of the difference)

Tensor Core (GPU)

  • Executes a single instruction

  • Needs registers + shared memory

  • Controlled by warps

  • Fed by caches

Systolic array (TPU)

  • Thousands of MACs wired together

  • Data pulses through like a circuit

  • Every cycle does useful work

  • No control overhead per operation

📌 Why TPUs win

  • Higher utilization

  • Lower energy per MAC

  • No instruction overhead per multiply


4. Compiler role: XLA vs CUDA

TPU + XLA

  • Model is fully compiled

  • Memory addresses are fixed

  • Execution timing is fixed

  • Data movement is explicit

GPU + CUDA

  • Kernels launched dynamically

  • Memory accessed at runtime

  • Scheduling is reactive

  • Caches hide uncertainty

🧠 Rule of thumb

Compilation beats speculation when the workload is predictable.

Transformers are predictable.


5. Why TPUs dominate training throughput

Training characteristics:

  • Huge batch sizes

  • Dense matrix multiplies

  • Repeated many times

TPUs shine because:

  • Systolic arrays stay 90%+ utilized

  • SRAM reuse is extremely high

  • Power per FLOP is low

  • Torus interconnect matches dense layers well

GPUs:

  • Can match peak FLOPs

  • Rarely match sustained FLOPs

  • Burn more power on control + memory


6. Why GPUs still matter (and where TPUs lose)

GPUs win when:

  • Models are dynamic

  • Control flow is irregular

  • Ops are custom

  • Batch sizes are small

  • Latency matters more than throughput

TPUs struggle when:

  • Shapes change often

  • Routing is dynamic (MoE)

  • Workload is sparse

  • Short jobs dominate (compile cost)

This is why:

  • Google uses TPUs internally

  • But still offers GPUs broadly

  • And why Nvidia still dominates the ecosystem


7. The one diagram that explains everything

GPU philosophy

Instruction → Decide → Fetch → Compute → Write → Repeat

TPU philosophy

Data → Flow → Compute → Flow → Done

Final takeaway (very tight)

TPUs do better than GPUs because they remove decision-making from the inner loop.
When the math is known in advance (transformers, dense training), dataflow beats instruction-driven execution every time — in speed, power, and scale.

That’s the entire story.

  • Why this advantage shrinks in inference

  • Why MoE breaks TPU elegance

  • Why Nvidia didn’t copy systolic arrays

  •  why TPUs are hard to sell outside Google - updates here

No comments: