Chapter 7
Precision, Sparsity, and Energy: Why Speedups Are Rarely Free
The Core Question
Why don’t reduced precision and sparsity always translate into speedups?
7.1 Precision Is a Bandwidth Optimization First
Lower precision:
-
Reduces memory footprint
-
Increases effective bandwidth
-
Improves cache reuse
Only secondarily:
-
Increases compute throughput
If a kernel is:
-
Memory-bound → precision helps
-
Compute-bound → precision helps only if hardware supports it
7.2 Tensor Cores Revisited (Without Hype)
Tensor cores:
-
Execute dense, aligned matrix ops
-
Require specific shapes
-
Demand compiler cooperation
If software fails to:
-
Tile correctly
-
Align data
-
Fuse ops
Then tensor cores sit idle.
Specialized hardware amplifies good structure — it does not create it.
7.3 Sparsity: Why It’s Hard
Unstructured sparsity
-
Random zeros
-
Irregular access
-
Poor hardware mapping
Structured sparsity
-
Blocked patterns
-
Predictable skips
-
Hardware-friendly
Most sparsity papers ignore:
-
Index overhead
-
Control cost
-
Load imbalance
Which is why promised speedups rarely appear.
7.4 Energy Is the Real Objective Function
Ultimately:
-
Power limits performance
-
Data movement sets the energy floor
-
FLOPs/W matters more than FLOPs
This is why:
-
AI hardware trends toward specialization
-
On-chip memory grows
-
Generality shrinks
Chapter 7 Takeaway
An optimization only matters if it reduces data movement or communication — otherwise it is cosmetic.
Deep References 🔍
🟢 Conceptual
-
Sze — energy breakdowns
-
Hennessy & Patterson — power walls
🟡 Architecture
-
NVIDIA mixed-precision docs
-
TPU quantization papers
🔴 Hardware
-
Sparse accelerator designs
-
Energy modeling (ISSCC)
No comments:
Post a Comment