FLOPs are cheap, bytes are expensive
Core thesis:
By 2018, ML stopped being compute-limited. The real bottleneck became moving data — weights, activations, KV caches — across memory hierarchies.
Hardware angle:
-
HBM vs SRAM energy gap
-
Why attention is worse than convolutions
-
Why operator fusion matters more than new layers
Key insight:
“The fastest accelerator is the one that doesn’t touch DRAM.”
No comments:
Post a Comment