Friday, December 26, 2025

How hardware ↔ ML co-evolved (the big picture)

Why Hardware engineers should learn ML

 The map of mutual influence — hardware shaping ML algorithms, and ML pushing hardware changes back.

A. The feedback loop (high level)

  1. Hardware constraint appears
    → limited memory, bandwidth, or parallelism

  2. ML adapts algorithmically
    → new architectures, training tricks, inference strategies

  3. Those algorithms become dominant workloads
    → hardware vendors optimize for them

  4. New hardware capabilities appear
    → ML researchers exploit them, creating new algorithms

This loop repeats every ~2–4 years.


B. Concrete historical shifts (chronological list)

1. CPUs → GPUs (2008–2014)

Hardware reality

  • CPUs: low parallelism, strong control flow

  • GPUs: massive SIMD throughput, weak branching

ML adaptation

  • Shift from symbolic ML → dense linear algebra

  • CNNs, matrix multiplies, batch processing

Hardware response

  • CUDA, cuDNN

  • Tensor cores later added specifically for GEMMs


2. GPU memory bandwidth wall (2016–2019)

Hardware reality

  • FLOPs grew faster than HBM bandwidth

  • Models became memory-bound, not compute-bound

ML adaptation

  • Layer normalization

  • Fused ops

  • Activation checkpointing

  • Attention replaces RNNs (better parallelism)

Hardware response

  • Larger on-chip SRAM

  • Wider memory buses

  • Operator fusion in compilers (XLA, TVM)


3. Transformers dominate → attention bottleneck (2019–2022)

Hardware reality

  • Attention is quadratic

  • KV cache explodes with sequence length

  • Inference limited by memory movement, not math

ML adaptation

  • Sparse attention

  • FlashAttention

  • KV caching

  • Rotary embeddings (better reuse)

Hardware response

  • Tensor cores optimized for smaller datatypes

  • SRAM tiling for attention

  • Hopper adds Transformer Engine


4. Inference becomes the cost center (2022–2024)

Hardware reality

  • Training is episodic, inference is continuous

  • Serving costs dominate energy + capex

ML adaptation

  • Quantization (INT8, FP8)

  • Distillation

  • Speculative decoding

  • Post-training alignment (instead of retraining)

Hardware response

  • FP8 native support

  • INT4/INT8 accelerators

  • KV cache residency optimizations


5. Long context breaks hardware assumptions (2023–2025)

Hardware reality

  • You cannot fit million-token attention in memory

  • DRAM access dominates energy cost

ML adaptation

  • Chunked processing

  • Memory-augmented inference

  • Post-training long-context alignment (QwenLong-L1.5)

  • Recurrence returns (but hardware-friendly)

Hardware implication (still unfolding)

  • Streaming-first accelerators

  • Larger on-chip memory

  • Better support for stateful inference

This is where QwenLong-L1.5 sits.

No comments: