Adhyayan: TPU vs GPU difference at microarchitecture level

Friday, December 26, 2025

TPU vs GPU difference at microarchitecture level

1. GPU execution pipeline (why flexibility costs efficiency)

What a GPU is optimizing for

Many different programs
Many different access patterns
Many different users

What actually happens during a matmul on a GPU

Instructions are fetched and decoded
Warps are scheduled dynamically
Data is pulled from HBM → L2 → shared memory
Tensor Cores execute tiles
Partial results are written back
Synchronization barriers occur
Repeat per kernel launch

Even with Tensor Cores:

Execution is instruction-driven
Memory is demand-fetched
Caches guess what you’ll need

🧠 Key point
The GPU is constantly deciding what to do next.

That decision logic costs:

Power
Area
Latency
Determinism

2. TPU execution pipeline (why it’s more efficient)

TPU has three main blocks

MXU – Matrix Unit (systolic array)
VPU – Vector Processing Unit (control / non-matmul ops)
SRAM – large on-chip scratchpads

What happens during a matmul on a TPU

Data is preloaded into SRAM
A and B values stream into the MXU
Multiply-accumulate happens every cycle
Partial sums stay local
Results stream out — no cache, no stalls

No instruction fetch.
No dynamic scheduling.
No cache misses.

🧠 Key point
The TPU doesn’t decide — it just flows.

3. Systolic array vs Tensor Core (the heart of the difference)

Tensor Core (GPU)

Executes a single instruction
Needs registers + shared memory
Controlled by warps
Fed by caches

Systolic array (TPU)

Thousands of MACs wired together
Data pulses through like a circuit
Every cycle does useful work
No control overhead per operation

📌 Why TPUs win

Higher utilization
Lower energy per MAC
No instruction overhead per multiply

4. Compiler role: XLA vs CUDA

TPU + XLA

Model is fully compiled
Memory addresses are fixed
Execution timing is fixed
Data movement is explicit

GPU + CUDA

Kernels launched dynamically
Memory accessed at runtime
Scheduling is reactive
Caches hide uncertainty

🧠 Rule of thumb

Compilation beats speculation when the workload is predictable.

Transformers are predictable.

5. Why TPUs dominate training throughput

Training characteristics:

Huge batch sizes
Dense matrix multiplies
Repeated many times

TPUs shine because:

Systolic arrays stay 90%+ utilized
SRAM reuse is extremely high
Power per FLOP is low
Torus interconnect matches dense layers well

GPUs:

Can match peak FLOPs
Rarely match sustained FLOPs
Burn more power on control + memory

6. Why GPUs still matter (and where TPUs lose)

GPUs win when:

Models are dynamic
Control flow is irregular
Ops are custom
Batch sizes are small
Latency matters more than throughput

TPUs struggle when:

Shapes change often
Routing is dynamic (MoE)
Workload is sparse
Short jobs dominate (compile cost)

This is why:

Google uses TPUs internally
But still offers GPUs broadly
And why Nvidia still dominates the ecosystem

7. The one diagram that explains everything

GPU philosophy


Instruction → Decide → Fetch → Compute → Write → Repeat

TPU philosophy


Data → Flow → Compute → Flow → Done

Final takeaway (very tight)

TPUs do better than GPUs because they remove decision-making from the inner loop.
When the math is known in advance (transformers, dense training), dataflow beats instruction-driven execution every time — in speed, power, and scale.

That’s the entire story.

Why this advantage shrinks in inference
Why MoE breaks TPU elegance
Why Nvidia didn’t copy systolic arrays
why TPUs are hard to sell outside Google - updates here

Adhyayan