Vijay Janapa Reddi just put the entire ML Systems (CS249r)
Book - mlsysbook.ai/book/
https://github.com/harvard-edge/cs249r_book
Vijay Janapa Reddi just put the entire ML Systems (CS249r)
Book - https://mlsysbook.ai/book/
https://github.com/harvard-edge/cs249r_book
https://x.com/drscotthawley/ - chunk sigreg slices
majority of gpu cycles in python
https://www.doubleai.com/research/doubleais-warpspeed-surpassing-expert-written-kernels-at-scale - distribution of error in human code different from
https://leodemoura.github.io/blog/2026/02/28/when-ai-writes-the-worlds-software.html
A Stochastic Rounding-Enabled Low-Precision Floating-Point MAC for DNN Training
Formal Verification of an Iterative Low-Power x86 Floating-Point Multiplier with Redundant Feedback
https://x.com/MankyDankyBanky/status/2028923461213798724 - tiled matrix multiplication on GPU
That mental model is good.
On modern TPUs (including v5e), the simplified flow is:
HBM (off-chip DRAM) → On-chip SRAM (VMEM) → Vector / scalar registers → Compute units (VPU/MXU)
The kitchen analogy works conceptually.
What’s especially good:
They understand every operation must eventually happen in registers.
They understand data movement dominates performance.
They recognize tiling is required because VMEM and registers are limited.
That’s foundational systems thinking.
Correct spirit.
The VPU processes large vector widths (implementation details vary slightly by generation), but the key idea is:
TPUs are throughput machines. They operate on wide tiles, not scalars.
That mental shift — from scalar programming to tile programming — is huge.
This is a strong observation:
VPU → synchronous, elementwise ops
MXU → asynchronous matrix multiply pipeline
The important concept they caught:
You can overlap scalar/vector work with MXU matmuls.
That’s essentially understanding latency hiding via pipeline parallelism.
That is exactly how high-performance kernels are written.
Very good.
This explanation is clean and mostly accurate:
block_shape → tile size
index_map → mapping grid coords to array slice
grid → launch dimensions
And yes:
grid_index × block_shape = offset into tensor
That’s correct at a conceptual level.
This is 100% correct:
Never tile across the reduction dimension unless you know how to handle partial reductions.
Softmax reduces across columns (usually). So you:
Tile rows
Keep full reduction dimension intact
This is a core GPU/TPU rule too.
Correct and important.
BF16:
8 exponent bits (like FP32)
7 mantissa bits
Same dynamic range, lower precision
Accumulating in BF16 would:
Introduce large rounding error
Cause catastrophic loss in reductions
All high-performance kernels accumulate in FP32.
This shows strong numerical awareness.
Now let’s refine some details.
Conceptually true — but performance engineers think differently:
The real goal is:
Move data from HBM as few times as possible.
Good kernels:
Load once from HBM
Reuse heavily in VMEM / registers
Write back once
So a more precise framing:
Instead of:
Every op follows that path
Better:
All data originates in HBM, but high-performance kernels maximize reuse in on-chip memory to avoid going back.
The difference matters.
Yes — but nuance:
It’s not free parallelism.
It works because:
The scalar unit issues MXU instructions.
The MXU has its own deep pipeline.
The compiler/runtime schedules around it.
The overlap works only if:
Dependencies allow it.
You structure the kernel to expose independent work.
So the deeper insight:
Asynchrony only helps if you create independent instruction streams.
Otherwise you just stall.
Instead of:
Never tile along a dimension you're reducing over.
More general:
You can tile reduction dimensions — but then you must:
Compute partial reductions
Store partial sums
Perform a second reduction pass
So the full rule is:
Tiling across reduction dims increases algorithmic complexity and synchronization cost.
They learned the simplified case. The advanced version is worth adding.
They’re close to a big “aha” but didn’t state it:
There are two kernel regimes:
MXU saturated
Performance limited by FLOPs
HBM bandwidth limited
Performance limited by bytes/sec
Softmax is often memory-bound.
MatMul is compute-bound (if sized well).
This distinction is foundational in inference engineering.
These would level them up.
Many beginners think:
We tile because memory is small.
Real reason:
We tile to maximize arithmetic intensity.
Arithmetic intensity = FLOPs / bytes moved.
MatMul works because:
Each loaded value participates in many multiply-adds.
So compute dominates memory cost.
Softmax has low reuse → often bandwidth bound.
That mental shift is huge.
The MXU isn’t just a fast matmul unit.
It’s a systolic array:
Data flows rhythmically across a 2D grid.
Multiply-accumulate units arranged spatially.
Partial sums propagate across the array.
That explains:
Why tile shapes matter.
Why alignment matters.
Why padding matters.
The best TPU kernels:
Double buffer VMEM
Prefetch next tile while computing current tile
Overlap MXU with VPU
Avoid bank conflicts
Align shapes to hardware tile sizes
The game is not just math.
It’s choreography.
If they’re reading inference engineering:
Key difference:
Training → needs backward pass + activation storage
Inference → cares about latency + throughput
Softmax in inference:
Often fused
Sometimes avoided entirely (logits used directly)
Kernel design differs.
Session 23 AI accelerators ISSCC 2025
https://fpga.org/2014/12/31/the-past-and-future-of-fpga-soft-processors/
https://notes.ekzhang.com/events/nysrg - computer systems
https://cacm.acm.org/research/deconstructing-the-bakery-to-build-a-distributed-state-machine/
https://www.developing.dev/p/turing-award-winner-on-working-with
Design of a GPU with Heterogeneous Cores for Graphics
https://dyahadila.github.io/blog/2026/industry-job-search-cs-phd/
https://pages.cs.wisc.edu/~fredsala/cs639/schedule.html
DeepSeek gpus' vram only needs to hold a single layer's worth of kv vectors to serve a request
https://www.lesswrong.com/posts/MNpBCtmZmqD7yk4q8/my-understanding-of-anthropic-strategy
mode collapse and mode dropping.
SIMD yield from disciplining the programmer into writing branchless functions and pipelines - good thread
SWAR SIMD-within-a-Register Arithmetic
real gap between hyperscale GPU scheduling (frontier clusters) and the everyday reality of small research labs running long single-GPU jobs.
Short answer:
overlaps with GPU context preemption, memory oversubscription, and gang scheduling, but today’s systems don’t expose a clean, transparent “pause/resume GPU job” abstraction.
There is research on this — just scattered across OS, ML systems, and datacenter scheduling literature rather than in one clean textbook chapter.
Let’s unpack it.
Conceptually, yes — but with sharp edges.
Its combining:
GPU page tables
NVIDIA Unified Virtual Memory (UVM)
NVIDIA Multi-Process Service (MPS)
Host-side suspension of the CPU process
Demand paging back to device memory
In theory, this gives you:
Coarse-grained GPU job preemption (tens of seconds)
via memory eviction + host suspension
The key insight is important:
You don’t need millisecond-level SM preemption.
You need coarse-grained fairness and priority enforcement.
That’s much more tractable.
The core problems:
Even if memory can be paged:
Kernel state
In-flight work
Streams
CUDA graphs
Allocator metadata
Device-side malloc
cuDNN/cuBLAS internal workspaces
These aren’t always safely “freeze-dried.”
GPU drivers do support context switching, but:
It’s optimized for graphics + short compute kernels
Not for long-running ML training loops
UVM paging over PCIe:
~12–32 GB/s (PCIe)
~900 GB/s (HBM)
So paging device memory to host is ~30–70× slower bandwidth.
If a low-priority job pages back in 40 GB of model state, you’ve just paid a huge penalty.
But — if switching is infrequent and intentional (tens of seconds), this may be acceptable.
This becomes a policy problem:
When to evict?
When to preserve?
When to checkpoint instead?
CPU OSes evolved around:
Fast context switch
Full memory virtualization
Demand paging as first-class
GPUs evolved around:
Huge batches
Massive parallel throughput
Long kernels
Preemption was not the primary design goal.
The idea intersects with several areas:
Look for work on:
Fine-grained GPU preemption
Kernel slicing
Thread block level preemption
Search terms:
"GPU kernel preemption"
"GPU context switching overhead"
Many papers out of systems conferences (OSDI, SOSP, ASPLOS).
A lot of this appears in ML systems research:
USENIX OSDI papers
ACM SOSP / ASPLOS
USENIX ATC
Look at systems like:
Gandiva (time-slicing DL training jobs)
Tiresias (GPU cluster scheduling)
Salus (fine-grained GPU sharing for DL)
These explicitly try to solve:
Short job first scheduling for GPUs
They often:
Time-slice at iteration boundaries
Require framework integration (e.g., PyTorch hooks)
Which is exactly what you’re trying to avoid.
No mainstream OS textbook deeply covers GPU preemption yet, but these are foundational:
Operating Systems: Three Easy Pieces
(Best conceptual grounding in scheduling + memory)
Modern Operating Systems
(Classic coverage of preemption + VM)
For GPU-specific material:
“Heterogeneous Computing” courses (advanced grad level)
Papers from ASPLOS / ISCA / MICRO
The strongest idea expressed is this:
Coarse-grained, memory-driven GPU job suspension is “good enough.”
That’s actually a very practical engineering perspective.
You don’t need:
Fine-grained SM warp preemption
Transparent kernel-level swapping
You need:
“Pause at iteration boundary”
Evict memory
Resume later
Which suggests a hybrid model:
Driver-level memory eviction
Framework-level safe points
Scheduler-level policy
If I were designing this for a modest research cluster:
Force iteration-boundary synchronization
Inject scheduler checks in training loops
Use CUDA context destruction + memory release
Reload from checkpoint
That’s ugly — but stable.
memory-based freeze/resume could work if:
GPU driver exposed explicit "evict context"
CUDA allocator was virtualized via shim (as you suggest with MPS)
Frameworks cooperated minimally
But today, that’s not officially supported.
Nvidia would prefer you to Just Buy More GPUs!
There’s some truth there — but also:
Hyperscalers want multi-tenant utilization
Research clusters are a smaller market
Preemption reduces peak throughput
Hardware simplicity favors non-preemptive design
That said, with GPU scarcity cycles, this problem becomes more important.
Search for:
“Gang scheduling for GPUs”
“Preemptible deep learning jobs”
“GPU memory oversubscription”
“Time-slicing deep learning workloads”
Look up papers like:
Gandiva (Microsoft Research)
Salus
Tiresias
Most are conference papers, not textbooks.
idea is:
Technically informed
Feels feasible at coarse granularity
Underexplored for small clusters
Very aligned with real pain points in academia
The hardest part isn’t paging.
It’s defining:
Safe preemption boundaries
Consistent device state
Allocator determinism
Avoiding pathological thrashing
But as a research direction? Completely legitimate.
If you’d like, I can:
Sketch what a “minimal viable preemptible GPU runtime” would look like
Or point you to specific seminal papers in this area
Or evaluate whether this could be a publishable systems paper idea
This is exactly the kind of idea that turns into a good ASPLOS paper.