Monday, March 02, 2026

updates

 Decoded stream buffer

AI capex math

TPU v5 deep dive

What They Understand Well

1) Memory hierarchy intuition (HBM → VMEM → registers)

That mental model is good.

On modern TPUs (including v5e), the simplified flow is:

HBM (off-chip DRAM)On-chip SRAM (VMEM)Vector / scalar registersCompute units (VPU/MXU)

The kitchen analogy works conceptually.

What’s especially good:

  • They understand every operation must eventually happen in registers.

  • They understand data movement dominates performance.

  • They recognize tiling is required because VMEM and registers are limited.

That’s foundational systems thinking.


2) Parallelism intuition (8×128 tiles, 1024 elements)

Correct spirit.

The VPU processes large vector widths (implementation details vary slightly by generation), but the key idea is:

TPUs are throughput machines. They operate on wide tiles, not scalars.

That mental shift — from scalar programming to tile programming — is huge.


3) VPU vs MXU execution model

This is a strong observation:

  • VPU → synchronous, elementwise ops

  • MXU → asynchronous matrix multiply pipeline

The important concept they caught:

You can overlap scalar/vector work with MXU matmuls.

That’s essentially understanding latency hiding via pipeline parallelism.

That is exactly how high-performance kernels are written.

Very good.


4) BlockSpec + grid explanation

This explanation is clean and mostly accurate:

  • block_shape → tile size

  • index_map → mapping grid coords to array slice

  • grid → launch dimensions

And yes:

grid_index × block_shape = offset into tensor

That’s correct at a conceptual level.


5) Softmax tiling insight

This is 100% correct:

Never tile across the reduction dimension unless you know how to handle partial reductions.

Softmax reduces across columns (usually). So you:

  • Tile rows

  • Keep full reduction dimension intact

This is a core GPU/TPU rule too.


6) BF16 inputs, FP32 accumulators

Correct and important.

BF16:

  • 8 exponent bits (like FP32)

  • 7 mantissa bits

  • Same dynamic range, lower precision

Accumulating in BF16 would:

  • Introduce large rounding error

  • Cause catastrophic loss in reductions

All high-performance kernels accumulate in FP32.

This shows strong numerical awareness.


🔧 What Could Be Clarified or Made More General

Now let’s refine some details.


1) “Every operation follows HBM → VMEM → VREG”

Conceptually true — but performance engineers think differently:

The real goal is:

Move data from HBM as few times as possible.

Good kernels:

  • Load once from HBM

  • Reuse heavily in VMEM / registers

  • Write back once

So a more precise framing:

Instead of:

Every op follows that path

Better:

All data originates in HBM, but high-performance kernels maximize reuse in on-chip memory to avoid going back.

The difference matters.


2) “MXU runs asynchronously”

Yes — but nuance:

It’s not free parallelism.

It works because:

  • The scalar unit issues MXU instructions.

  • The MXU has its own deep pipeline.

  • The compiler/runtime schedules around it.

The overlap works only if:

  • Dependencies allow it.

  • You structure the kernel to expose independent work.

So the deeper insight:

Asynchrony only helps if you create independent instruction streams.

Otherwise you just stall.


3) Softmax tiling rule — generalize it

Instead of:

Never tile along a dimension you're reducing over.

More general:

You can tile reduction dimensions — but then you must:

  • Compute partial reductions

  • Store partial sums

  • Perform a second reduction pass

So the full rule is:

Tiling across reduction dims increases algorithmic complexity and synchronization cost.

They learned the simplified case. The advanced version is worth adding.


4) Missing concept: Bandwidth vs Compute Bound

They’re close to a big “aha” but didn’t state it:

There are two kernel regimes:

Compute-bound

MXU saturated
Performance limited by FLOPs

Memory-bound

HBM bandwidth limited
Performance limited by bytes/sec

Softmax is often memory-bound.
MatMul is compute-bound (if sized well).

This distinction is foundational in inference engineering.


🧠 Extra “Ahas” They’re Ready For

These would level them up.


🔥 Aha 1: Tiling Is About Reuse, Not Fitting

Many beginners think:

We tile because memory is small.

Real reason:

We tile to maximize arithmetic intensity.

Arithmetic intensity = FLOPs / bytes moved.

MatMul works because:

  • Each loaded value participates in many multiply-adds.

  • So compute dominates memory cost.

Softmax has low reuse → often bandwidth bound.

That mental shift is huge.


🔥 Aha 2: MXU Is a Systolic Array

The MXU isn’t just a fast matmul unit.

It’s a systolic array:

  • Data flows rhythmically across a 2D grid.

  • Multiply-accumulate units arranged spatially.

  • Partial sums propagate across the array.

That explains:

  • Why tile shapes matter.

  • Why alignment matters.

  • Why padding matters.


🔥 Aha 3: Kernel performance is scheduling + memory choreography

The best TPU kernels:

  • Double buffer VMEM

  • Prefetch next tile while computing current tile

  • Overlap MXU with VPU

  • Avoid bank conflicts

  • Align shapes to hardware tile sizes

The game is not just math.

It’s choreography.


🔥 Aha 4: Inference vs Training

If they’re reading inference engineering:

Key difference:

  • Training → needs backward pass + activation storage

  • Inference → cares about latency + throughput

Softmax in inference:

  • Often fused

  • Sometimes avoided entirely (logits used directly)

Kernel design differs.

Tuesday, February 17, 2026

Compute Conference

 Career opportunities

challenges of training GANs

Wasserstein distance

Jensen Shannon Divergence

mode collapse and mode dropping.

compute conference

Intel's AMX

SIMD yield from disciplining the programmer into writing branchless functions and pipelines - good thread

SWAR SIMD-within-a-Register Arithmetic

GPU scheduling

real gap between hyperscale GPU scheduling (frontier clusters) and the everyday reality of small research labs running long single-GPU jobs.

Short answer:

  •  overlaps with GPU context preemption, memory oversubscription, and gang scheduling, but today’s systems don’t expose a clean, transparent “pause/resume GPU job” abstraction.

  • There is research on this — just scattered across OS, ML systems, and datacenter scheduling literature rather than in one clean textbook chapter.

Let’s unpack it.


1. Is the Idea Plausible?

Conceptually, yes — but with sharp edges.

Its combining:

  • GPU page tables

  • NVIDIA Unified Virtual Memory (UVM)

  • NVIDIA Multi-Process Service (MPS)

  • Host-side suspension of the CPU process

  • Demand paging back to device memory

In theory, this gives you:

Coarse-grained GPU job preemption (tens of seconds)
via memory eviction + host suspension

The key insight is important:

You don’t need millisecond-level SM preemption.
You need coarse-grained fairness and priority enforcement.

That’s much more tractable.


2. Why This Is Hard in Practice

The core problems:

(A) GPU Context Isn’t Just Memory

Even if memory can be paged:

  • Kernel state

  • In-flight work

  • Streams

  • CUDA graphs

  • Allocator metadata

  • Device-side malloc

  • cuDNN/cuBLAS internal workspaces

These aren’t always safely “freeze-dried.”

GPU drivers do support context switching, but:

  • It’s optimized for graphics + short compute kernels

  • Not for long-running ML training loops


(B) Page Thrashing Would Be Brutal

UVM paging over PCIe:

  • ~12–32 GB/s (PCIe)

  • ~900 GB/s (HBM)

So paging device memory to host is ~30–70× slower bandwidth.

If a low-priority job pages back in 40 GB of model state, you’ve just paid a huge penalty.

But — if switching is infrequent and intentional (tens of seconds), this may be acceptable.

This becomes a policy problem:

  • When to evict?

  • When to preserve?

  • When to checkpoint instead?


(C) GPUs Are Built for Throughput, Not Preemptibility

CPU OSes evolved around:

  • Fast context switch

  • Full memory virtualization

  • Demand paging as first-class

GPUs evolved around:

  • Huge batches

  • Massive parallel throughput

  • Long kernels

Preemption was not the primary design goal.


3. This Is an Active Research Area

The idea intersects with several areas:

1️⃣ GPU Preemption Research

Look for work on:

  • Fine-grained GPU preemption

  • Kernel slicing

  • Thread block level preemption

Search terms:

  • "GPU kernel preemption"

  • "GPU context switching overhead"

Many papers out of systems conferences (OSDI, SOSP, ASPLOS).


2️⃣ Cluster-Level GPU Scheduling

A lot of this appears in ML systems research:

  • USENIX OSDI papers

  • ACM SOSP / ASPLOS

  • USENIX ATC

Look at systems like:

  • Gandiva (time-slicing DL training jobs)

  • Tiresias (GPU cluster scheduling)

  • Salus (fine-grained GPU sharing for DL)

These explicitly try to solve:

Short job first scheduling for GPUs

They often:

  • Time-slice at iteration boundaries

  • Require framework integration (e.g., PyTorch hooks)

Which is exactly what you’re trying to avoid.


3️⃣ OS Textbooks That Cover Foundations

No mainstream OS textbook deeply covers GPU preemption yet, but these are foundational:

  • Operating Systems: Three Easy Pieces
    (Best conceptual grounding in scheduling + memory)

  • Modern Operating Systems
    (Classic coverage of preemption + VM)

For GPU-specific material:

  • “Heterogeneous Computing” courses (advanced grad level)

  • Papers from ASPLOS / ISCA / MICRO


4. The Deep Insight 

The strongest idea expressed is this:

Coarse-grained, memory-driven GPU job suspension is “good enough.”

That’s actually a very practical engineering perspective.

You don’t need:

  • Fine-grained SM warp preemption

  • Transparent kernel-level swapping

You need:

  • “Pause at iteration boundary”

  • Evict memory

  • Resume later

Which suggests a hybrid model:

  • Driver-level memory eviction

  • Framework-level safe points

  • Scheduler-level policy


5. What Would Make This Realistic?

If I were designing this for a modest research cluster:

Practical Path:

  1. Force iteration-boundary synchronization

  2. Inject scheduler checks in training loops

  3. Use CUDA context destruction + memory release

  4. Reload from checkpoint

That’s ugly — but stable.

 memory-based freeze/resume could work if:

  • GPU driver exposed explicit "evict context"

  • CUDA allocator was virtualized via shim (as you suggest with MPS)

  • Frameworks cooperated minimally

But today, that’s not officially supported.


6. Why NVIDIA Doesn’t Prioritize This

Nvidia would prefer you to Just Buy More GPUs!

There’s some truth there — but also:

  • Hyperscalers want multi-tenant utilization

  • Research clusters are a smaller market

  • Preemption reduces peak throughput

  • Hardware simplicity favors non-preemptive design

That said, with GPU scarcity cycles, this problem becomes more important.


7. If You Want to Dive Deeper

Search for:

  • “Gang scheduling for GPUs”

  • “Preemptible deep learning jobs”

  • “GPU memory oversubscription”

  • “Time-slicing deep learning workloads”

Look up papers like:

  • Gandiva (Microsoft Research)

  • Salus

  • Tiresias

Most are conference papers, not textbooks.


8. My Take

 idea is:

  • Technically informed

  • Feels feasible at coarse granularity

  • Underexplored for small clusters

  • Very aligned with real pain points in academia

The hardest part isn’t paging.

It’s defining:

  • Safe preemption boundaries

  • Consistent device state

  • Allocator determinism

  • Avoiding pathological thrashing

But as a research direction? Completely legitimate.


If you’d like, I can:

  • Sketch what a “minimal viable preemptible GPU runtime” would look like

  • Or point you to specific seminal papers in this area

  • Or evaluate whether this could be a publishable systems paper idea

This is exactly the kind of idea that turns into a good ASPLOS paper.

Monday, February 16, 2026