Adhyayan
Friday, February 20, 2026
Tuesday, February 17, 2026
Compute Conference
mode collapse and mode dropping.
SIMD yield from disciplining the programmer into writing branchless functions and pipelines - good thread
SWAR SIMD-within-a-Register Arithmetic
real gap between hyperscale GPU scheduling (frontier clusters) and the everyday reality of small research labs running long single-GPU jobs.
Short answer:
-
overlaps with GPU context preemption, memory oversubscription, and gang scheduling, but today’s systems don’t expose a clean, transparent “pause/resume GPU job” abstraction.
-
There is research on this — just scattered across OS, ML systems, and datacenter scheduling literature rather than in one clean textbook chapter.
Let’s unpack it.
1. Is the Idea Plausible?
Conceptually, yes — but with sharp edges.
Its combining:
-
GPU page tables
-
NVIDIA Unified Virtual Memory (UVM)
-
NVIDIA Multi-Process Service (MPS)
-
Host-side suspension of the CPU process
-
Demand paging back to device memory
In theory, this gives you:
Coarse-grained GPU job preemption (tens of seconds)
via memory eviction + host suspension
The key insight is important:
You don’t need millisecond-level SM preemption.
You need coarse-grained fairness and priority enforcement.
That’s much more tractable.
2. Why This Is Hard in Practice
The core problems:
(A) GPU Context Isn’t Just Memory
Even if memory can be paged:
-
Kernel state
-
In-flight work
-
Streams
-
CUDA graphs
-
Allocator metadata
-
Device-side malloc
-
cuDNN/cuBLAS internal workspaces
These aren’t always safely “freeze-dried.”
GPU drivers do support context switching, but:
-
It’s optimized for graphics + short compute kernels
-
Not for long-running ML training loops
(B) Page Thrashing Would Be Brutal
UVM paging over PCIe:
-
~12–32 GB/s (PCIe)
-
~900 GB/s (HBM)
So paging device memory to host is ~30–70× slower bandwidth.
If a low-priority job pages back in 40 GB of model state, you’ve just paid a huge penalty.
But — if switching is infrequent and intentional (tens of seconds), this may be acceptable.
This becomes a policy problem:
-
When to evict?
-
When to preserve?
-
When to checkpoint instead?
(C) GPUs Are Built for Throughput, Not Preemptibility
CPU OSes evolved around:
-
Fast context switch
-
Full memory virtualization
-
Demand paging as first-class
GPUs evolved around:
-
Huge batches
-
Massive parallel throughput
-
Long kernels
Preemption was not the primary design goal.
3. This Is an Active Research Area
The idea intersects with several areas:
1️⃣ GPU Preemption Research
Look for work on:
-
Fine-grained GPU preemption
-
Kernel slicing
-
Thread block level preemption
Search terms:
-
"GPU kernel preemption"
-
"GPU context switching overhead"
Many papers out of systems conferences (OSDI, SOSP, ASPLOS).
2️⃣ Cluster-Level GPU Scheduling
A lot of this appears in ML systems research:
-
USENIX OSDI papers
-
ACM SOSP / ASPLOS
-
USENIX ATC
Look at systems like:
-
Gandiva (time-slicing DL training jobs)
-
Tiresias (GPU cluster scheduling)
-
Salus (fine-grained GPU sharing for DL)
These explicitly try to solve:
Short job first scheduling for GPUs
They often:
-
Time-slice at iteration boundaries
-
Require framework integration (e.g., PyTorch hooks)
Which is exactly what you’re trying to avoid.
3️⃣ OS Textbooks That Cover Foundations
No mainstream OS textbook deeply covers GPU preemption yet, but these are foundational:
-
Operating Systems: Three Easy Pieces
(Best conceptual grounding in scheduling + memory) -
Modern Operating Systems
(Classic coverage of preemption + VM)
For GPU-specific material:
-
“Heterogeneous Computing” courses (advanced grad level)
-
Papers from ASPLOS / ISCA / MICRO
4. The Deep Insight
The strongest idea expressed is this:
Coarse-grained, memory-driven GPU job suspension is “good enough.”
That’s actually a very practical engineering perspective.
You don’t need:
-
Fine-grained SM warp preemption
-
Transparent kernel-level swapping
You need:
-
“Pause at iteration boundary”
-
Evict memory
-
Resume later
Which suggests a hybrid model:
-
Driver-level memory eviction
-
Framework-level safe points
-
Scheduler-level policy
5. What Would Make This Realistic?
If I were designing this for a modest research cluster:
Practical Path:
-
Force iteration-boundary synchronization
-
Inject scheduler checks in training loops
-
Use CUDA context destruction + memory release
-
Reload from checkpoint
That’s ugly — but stable.
memory-based freeze/resume could work if:
-
GPU driver exposed explicit "evict context"
-
CUDA allocator was virtualized via shim (as you suggest with MPS)
-
Frameworks cooperated minimally
But today, that’s not officially supported.
6. Why NVIDIA Doesn’t Prioritize This
Nvidia would prefer you to Just Buy More GPUs!
There’s some truth there — but also:
-
Hyperscalers want multi-tenant utilization
-
Research clusters are a smaller market
-
Preemption reduces peak throughput
-
Hardware simplicity favors non-preemptive design
That said, with GPU scarcity cycles, this problem becomes more important.
7. If You Want to Dive Deeper
Search for:
-
“Gang scheduling for GPUs”
-
“Preemptible deep learning jobs”
-
“GPU memory oversubscription”
-
“Time-slicing deep learning workloads”
Look up papers like:
-
Gandiva (Microsoft Research)
-
Salus
-
Tiresias
Most are conference papers, not textbooks.
8. My Take
idea is:
-
Technically informed
-
Feels feasible at coarse granularity
-
Underexplored for small clusters
-
Very aligned with real pain points in academia
The hardest part isn’t paging.
It’s defining:
-
Safe preemption boundaries
-
Consistent device state
-
Allocator determinism
-
Avoiding pathological thrashing
But as a research direction? Completely legitimate.
If you’d like, I can:
-
Sketch what a “minimal viable preemptible GPU runtime” would look like
-
Or point you to specific seminal papers in this area
-
Or evaluate whether this could be a publishable systems paper idea
This is exactly the kind of idea that turns into a good ASPLOS paper.