1. GPU execution pipeline (why flexibility costs efficiency)
What a GPU is optimizing for
-
Many different programs
-
Many different access patterns
-
Many different users
What actually happens during a matmul on a GPU
-
Instructions are fetched and decoded
-
Warps are scheduled dynamically
-
Data is pulled from HBM → L2 → shared memory
-
Tensor Cores execute tiles
-
Partial results are written back
-
Synchronization barriers occur
-
Repeat per kernel launch
Even with Tensor Cores:
-
Execution is instruction-driven
-
Memory is demand-fetched
-
Caches guess what you’ll need
🧠Key point
The GPU is constantly deciding what to do next.
That decision logic costs:
-
Power
-
Area
-
Latency
-
Determinism
2. TPU execution pipeline (why it’s more efficient)
TPU has three main blocks
-
MXU – Matrix Unit (systolic array)
-
VPU – Vector Processing Unit (control / non-matmul ops)
-
SRAM – large on-chip scratchpads
What happens during a matmul on a TPU
-
Data is preloaded into SRAM
-
A and B values stream into the MXU
-
Multiply-accumulate happens every cycle
-
Partial sums stay local
-
Results stream out — no cache, no stalls
No instruction fetch.
No dynamic scheduling.
No cache misses.
🧠Key point
The TPU doesn’t decide — it just flows.
3. Systolic array vs Tensor Core (the heart of the difference)
Tensor Core (GPU)
-
Executes a single instruction
-
Needs registers + shared memory
-
Controlled by warps
-
Fed by caches
Systolic array (TPU)
-
Thousands of MACs wired together
-
Data pulses through like a circuit
-
Every cycle does useful work
-
No control overhead per operation
📌 Why TPUs win
-
Higher utilization
-
Lower energy per MAC
-
No instruction overhead per multiply
4. Compiler role: XLA vs CUDA
TPU + XLA
-
Model is fully compiled
-
Memory addresses are fixed
-
Execution timing is fixed
-
Data movement is explicit
GPU + CUDA
-
Kernels launched dynamically
-
Memory accessed at runtime
-
Scheduling is reactive
-
Caches hide uncertainty
🧠Rule of thumb
Compilation beats speculation when the workload is predictable.
Transformers are predictable.
5. Why TPUs dominate training throughput
Training characteristics:
-
Huge batch sizes
-
Dense matrix multiplies
-
Repeated many times
TPUs shine because:
-
Systolic arrays stay 90%+ utilized
-
SRAM reuse is extremely high
-
Power per FLOP is low
-
Torus interconnect matches dense layers well
GPUs:
-
Can match peak FLOPs
-
Rarely match sustained FLOPs
-
Burn more power on control + memory
6. Why GPUs still matter (and where TPUs lose)
GPUs win when:
-
Models are dynamic
-
Control flow is irregular
-
Ops are custom
-
Batch sizes are small
-
Latency matters more than throughput
TPUs struggle when:
-
Shapes change often
-
Routing is dynamic (MoE)
-
Workload is sparse
-
Short jobs dominate (compile cost)
This is why:
-
Google uses TPUs internally
-
But still offers GPUs broadly
-
And why Nvidia still dominates the ecosystem
7. The one diagram that explains everything
GPU philosophy
TPU philosophy
Final takeaway (very tight)
TPUs do better than GPUs because they remove decision-making from the inner loop.
When the math is known in advance (transformers, dense training), dataflow beats instruction-driven execution every time — in speed, power, and scale.
That’s the entire story.
-
Why this advantage shrinks in inference
-
Why MoE breaks TPU elegance
-
Why Nvidia didn’t copy systolic arrays
-
why TPUs are hard to sell outside Google - updates here
No comments:
Post a Comment