Friday, December 26, 2025

The Memory Wall: Why Modern ML Is Bandwidth-Bound

 FLOPs are cheap, bytes are expensive

Core thesis:
By 2018, ML stopped being compute-limited. The real bottleneck became moving data — weights, activations, KV caches — across memory hierarchies.

Hardware angle:

  • HBM vs SRAM energy gap

  • Why attention is worse than convolutions

  • Why operator fusion matters more than new layers

Key insight:

“The fastest accelerator is the one that doesn’t touch DRAM.”

No comments: