Monday, March 02, 2026

updates

 Decoded stream buffer

AI capex math

TPU v5 deep dive

What They Understand Well

1) Memory hierarchy intuition (HBM → VMEM → registers)

That mental model is good.

On modern TPUs (including v5e), the simplified flow is:

HBM (off-chip DRAM)On-chip SRAM (VMEM)Vector / scalar registersCompute units (VPU/MXU)

The kitchen analogy works conceptually.

What’s especially good:

  • They understand every operation must eventually happen in registers.

  • They understand data movement dominates performance.

  • They recognize tiling is required because VMEM and registers are limited.

That’s foundational systems thinking.


2) Parallelism intuition (8×128 tiles, 1024 elements)

Correct spirit.

The VPU processes large vector widths (implementation details vary slightly by generation), but the key idea is:

TPUs are throughput machines. They operate on wide tiles, not scalars.

That mental shift — from scalar programming to tile programming — is huge.


3) VPU vs MXU execution model

This is a strong observation:

  • VPU → synchronous, elementwise ops

  • MXU → asynchronous matrix multiply pipeline

The important concept they caught:

You can overlap scalar/vector work with MXU matmuls.

That’s essentially understanding latency hiding via pipeline parallelism.

That is exactly how high-performance kernels are written.

Very good.


4) BlockSpec + grid explanation

This explanation is clean and mostly accurate:

  • block_shape → tile size

  • index_map → mapping grid coords to array slice

  • grid → launch dimensions

And yes:

grid_index × block_shape = offset into tensor

That’s correct at a conceptual level.


5) Softmax tiling insight

This is 100% correct:

Never tile across the reduction dimension unless you know how to handle partial reductions.

Softmax reduces across columns (usually). So you:

  • Tile rows

  • Keep full reduction dimension intact

This is a core GPU/TPU rule too.


6) BF16 inputs, FP32 accumulators

Correct and important.

BF16:

  • 8 exponent bits (like FP32)

  • 7 mantissa bits

  • Same dynamic range, lower precision

Accumulating in BF16 would:

  • Introduce large rounding error

  • Cause catastrophic loss in reductions

All high-performance kernels accumulate in FP32.

This shows strong numerical awareness.


🔧 What Could Be Clarified or Made More General

Now let’s refine some details.


1) “Every operation follows HBM → VMEM → VREG”

Conceptually true — but performance engineers think differently:

The real goal is:

Move data from HBM as few times as possible.

Good kernels:

  • Load once from HBM

  • Reuse heavily in VMEM / registers

  • Write back once

So a more precise framing:

Instead of:

Every op follows that path

Better:

All data originates in HBM, but high-performance kernels maximize reuse in on-chip memory to avoid going back.

The difference matters.


2) “MXU runs asynchronously”

Yes — but nuance:

It’s not free parallelism.

It works because:

  • The scalar unit issues MXU instructions.

  • The MXU has its own deep pipeline.

  • The compiler/runtime schedules around it.

The overlap works only if:

  • Dependencies allow it.

  • You structure the kernel to expose independent work.

So the deeper insight:

Asynchrony only helps if you create independent instruction streams.

Otherwise you just stall.


3) Softmax tiling rule — generalize it

Instead of:

Never tile along a dimension you're reducing over.

More general:

You can tile reduction dimensions — but then you must:

  • Compute partial reductions

  • Store partial sums

  • Perform a second reduction pass

So the full rule is:

Tiling across reduction dims increases algorithmic complexity and synchronization cost.

They learned the simplified case. The advanced version is worth adding.


4) Missing concept: Bandwidth vs Compute Bound

They’re close to a big “aha” but didn’t state it:

There are two kernel regimes:

Compute-bound

MXU saturated
Performance limited by FLOPs

Memory-bound

HBM bandwidth limited
Performance limited by bytes/sec

Softmax is often memory-bound.
MatMul is compute-bound (if sized well).

This distinction is foundational in inference engineering.


🧠 Extra “Ahas” They’re Ready For

These would level them up.


🔥 Aha 1: Tiling Is About Reuse, Not Fitting

Many beginners think:

We tile because memory is small.

Real reason:

We tile to maximize arithmetic intensity.

Arithmetic intensity = FLOPs / bytes moved.

MatMul works because:

  • Each loaded value participates in many multiply-adds.

  • So compute dominates memory cost.

Softmax has low reuse → often bandwidth bound.

That mental shift is huge.


🔥 Aha 2: MXU Is a Systolic Array

The MXU isn’t just a fast matmul unit.

It’s a systolic array:

  • Data flows rhythmically across a 2D grid.

  • Multiply-accumulate units arranged spatially.

  • Partial sums propagate across the array.

That explains:

  • Why tile shapes matter.

  • Why alignment matters.

  • Why padding matters.


🔥 Aha 3: Kernel performance is scheduling + memory choreography

The best TPU kernels:

  • Double buffer VMEM

  • Prefetch next tile while computing current tile

  • Overlap MXU with VPU

  • Avoid bank conflicts

  • Align shapes to hardware tile sizes

The game is not just math.

It’s choreography.


🔥 Aha 4: Inference vs Training

If they’re reading inference engineering:

Key difference:

  • Training → needs backward pass + activation storage

  • Inference → cares about latency + throughput

Softmax in inference:

  • Often fused

  • Sometimes avoided entirely (logits used directly)

Kernel design differs.

No comments: