Monday, December 22, 2025

Chapter 7 Precision, Sparsity, and Energy: Why Speedups Are Rarely Free

 

Chapter 7

Precision, Sparsity, and Energy: Why Speedups Are Rarely Free

The Core Question

Why don’t reduced precision and sparsity always translate into speedups?


7.1 Precision Is a Bandwidth Optimization First

Lower precision:

  • Reduces memory footprint

  • Increases effective bandwidth

  • Improves cache reuse

Only secondarily:

  • Increases compute throughput

If a kernel is:

  • Memory-bound → precision helps

  • Compute-bound → precision helps only if hardware supports it


7.2 Tensor Cores Revisited (Without Hype)

Tensor cores:

  • Execute dense, aligned matrix ops

  • Require specific shapes

  • Demand compiler cooperation

If software fails to:

  • Tile correctly

  • Align data

  • Fuse ops

Then tensor cores sit idle.

Specialized hardware amplifies good structure — it does not create it.


7.3 Sparsity: Why It’s Hard

Unstructured sparsity

  • Random zeros

  • Irregular access

  • Poor hardware mapping

Structured sparsity

  • Blocked patterns

  • Predictable skips

  • Hardware-friendly

Most sparsity papers ignore:

  • Index overhead

  • Control cost

  • Load imbalance

Which is why promised speedups rarely appear.


7.4 Energy Is the Real Objective Function

Ultimately:

  • Power limits performance

  • Data movement sets the energy floor

  • FLOPs/W matters more than FLOPs

This is why:

  • AI hardware trends toward specialization

  • On-chip memory grows

  • Generality shrinks


Chapter 7 Takeaway

An optimization only matters if it reduces data movement or communication — otherwise it is cosmetic.


Deep References 🔍

🟢 Conceptual

  • Sze — energy breakdowns

  • Hennessy & Patterson — power walls

🟡 Architecture

  • NVIDIA mixed-precision docs

  • TPU quantization papers

🔴 Hardware

  • Sparse accelerator designs

  • Energy modeling (ISSCC)


No comments: