What They Understand Well

1) Memory hierarchy intuition (HBM → VMEM → registers)

That mental model is good.

On modern TPUs (including v5e), the simplified flow is:

HBM (off-chip DRAM) → On-chip SRAM (VMEM) → Vector / scalar registers → Compute units (VPU/MXU)

The kitchen analogy works conceptually.

What’s especially good:

They understand every operation must eventually happen in registers.
They understand data movement dominates performance.
They recognize tiling is required because VMEM and registers are limited.

That’s foundational systems thinking.

2) Parallelism intuition (8×128 tiles, 1024 elements)

Correct spirit.

The VPU processes large vector widths (implementation details vary slightly by generation), but the key idea is:

TPUs are throughput machines. They operate on wide tiles, not scalars.

That mental shift — from scalar programming to tile programming — is huge.

3) VPU vs MXU execution model

This is a strong observation:

VPU → synchronous, elementwise ops
MXU → asynchronous matrix multiply pipeline

The important concept they caught:

You can overlap scalar/vector work with MXU matmuls.

That’s essentially understanding latency hiding via pipeline parallelism.

That is exactly how high-performance kernels are written.

Very good.

4) BlockSpec + grid explanation

This explanation is clean and mostly accurate:

block_shape → tile size
index_map → mapping grid coords to array slice
grid → launch dimensions

And yes:

grid_index × block_shape = offset into tensor

That’s correct at a conceptual level.

5) Softmax tiling insight

This is 100% correct:

Never tile across the reduction dimension unless you know how to handle partial reductions.

Softmax reduces across columns (usually). So you:

Tile rows
Keep full reduction dimension intact

This is a core GPU/TPU rule too.

6) BF16 inputs, FP32 accumulators

Correct and important.

BF16:

8 exponent bits (like FP32)
7 mantissa bits
Same dynamic range, lower precision

Accumulating in BF16 would:

Introduce large rounding error
Cause catastrophic loss in reductions

All high-performance kernels accumulate in FP32.

This shows strong numerical awareness.

🔧 What Could Be Clarified or Made More General

Now let’s refine some details.

1) “Every operation follows HBM → VMEM → VREG”

Conceptually true — but performance engineers think differently:

The real goal is:

Move data from HBM as few times as possible.

Good kernels:

Load once from HBM
Reuse heavily in VMEM / registers
Write back once

So a more precise framing:

Instead of:

Every op follows that path

Better:

All data originates in HBM, but high-performance kernels maximize reuse in on-chip memory to avoid going back.

The difference matters.

2) “MXU runs asynchronously”

Yes — but nuance:

It’s not free parallelism.

It works because:

The scalar unit issues MXU instructions.
The MXU has its own deep pipeline.
The compiler/runtime schedules around it.

The overlap works only if:

Dependencies allow it.
You structure the kernel to expose independent work.

So the deeper insight:

Asynchrony only helps if you create independent instruction streams.

Otherwise you just stall.

3) Softmax tiling rule — generalize it

Instead of:

Never tile along a dimension you're reducing over.

More general:

You can tile reduction dimensions — but then you must:

Compute partial reductions
Store partial sums
Perform a second reduction pass

So the full rule is:

Tiling across reduction dims increases algorithmic complexity and synchronization cost.

They learned the simplified case. The advanced version is worth adding.

4) Missing concept: Bandwidth vs Compute Bound

They’re close to a big “aha” but didn’t state it:

There are two kernel regimes:

Compute-bound

MXU saturated
Performance limited by FLOPs

Memory-bound

HBM bandwidth limited
Performance limited by bytes/sec

Softmax is often memory-bound.
MatMul is compute-bound (if sized well).

This distinction is foundational in inference engineering.

🧠 Extra “Ahas” They’re Ready For

These would level them up.

🔥 Aha 1: Tiling Is About Reuse, Not Fitting

Many beginners think:

We tile because memory is small.

Real reason:

We tile to maximize arithmetic intensity.

Arithmetic intensity = FLOPs / bytes moved.

MatMul works because:

Each loaded value participates in many multiply-adds.
So compute dominates memory cost.

Softmax has low reuse → often bandwidth bound.

That mental shift is huge.

🔥 Aha 2: MXU Is a Systolic Array

The MXU isn’t just a fast matmul unit.

It’s a systolic array:

Data flows rhythmically across a 2D grid.
Multiply-accumulate units arranged spatially.
Partial sums propagate across the array.

That explains:

Why tile shapes matter.
Why alignment matters.
Why padding matters.

🔥 Aha 3: Kernel performance is scheduling + memory choreography

The best TPU kernels:

Double buffer VMEM
Prefetch next tile while computing current tile
Overlap MXU with VPU
Avoid bank conflicts
Align shapes to hardware tile sizes

The game is not just math.

It’s choreography.

🔥 Aha 4: Inference vs Training

If they’re reading inference engineering:

Key difference:

Training → needs backward pass + activation storage
Inference → cares about latency + throughput

Softmax in inference:

Often fused
Sometimes avoided entirely (logits used directly)

Kernel design differs.

Adhyayan

Monday, March 02, 2026

updates