Adhyayan: How hardware ↔ ML co-evolved (the big picture)

Friday, December 26, 2025

How hardware ↔ ML co-evolved (the big picture)

Why Hardware engineers should learn ML

The map of mutual influence — hardware shaping ML algorithms, and ML pushing hardware changes back.

A. The feedback loop (high level)

Hardware constraint appears
→ limited memory, bandwidth, or parallelism
ML adapts algorithmically
→ new architectures, training tricks, inference strategies
Those algorithms become dominant workloads
→ hardware vendors optimize for them
New hardware capabilities appear
→ ML researchers exploit them, creating new algorithms

This loop repeats every ~2–4 years.

B. Concrete historical shifts (chronological list)

1. CPUs → GPUs (2008–2014)

Hardware reality

CPUs: low parallelism, strong control flow
GPUs: massive SIMD throughput, weak branching

ML adaptation

Shift from symbolic ML → dense linear algebra
CNNs, matrix multiplies, batch processing

Hardware response

CUDA, cuDNN
Tensor cores later added specifically for GEMMs

2. GPU memory bandwidth wall (2016–2019)

Hardware reality

FLOPs grew faster than HBM bandwidth
Models became memory-bound, not compute-bound

ML adaptation

Layer normalization
Fused ops
Activation checkpointing
Attention replaces RNNs (better parallelism)

Hardware response

Larger on-chip SRAM
Wider memory buses
Operator fusion in compilers (XLA, TVM)

3. Transformers dominate → attention bottleneck (2019–2022)

Hardware reality

Attention is quadratic
KV cache explodes with sequence length
Inference limited by memory movement, not math

ML adaptation

Sparse attention
FlashAttention
KV caching
Rotary embeddings (better reuse)

Hardware response

Tensor cores optimized for smaller datatypes
SRAM tiling for attention
Hopper adds Transformer Engine

4. Inference becomes the cost center (2022–2024)

Hardware reality

Training is episodic, inference is continuous
Serving costs dominate energy + capex

ML adaptation

Quantization (INT8, FP8)
Distillation
Speculative decoding
Post-training alignment (instead of retraining)

Hardware response

FP8 native support
INT4/INT8 accelerators
KV cache residency optimizations

5. Long context breaks hardware assumptions (2023–2025)

Hardware reality

You cannot fit million-token attention in memory
DRAM access dominates energy cost

ML adaptation

Chunked processing
Memory-augmented inference
Post-training long-context alignment (QwenLong-L1.5)
Recurrence returns (but hardware-friendly)

Hardware implication (still unfolding)

Streaming-first accelerators
Larger on-chip memory
Better support for stateful inference

This is where QwenLong-L1.5 sits.

Adhyayan

Friday, December 26, 2025

How hardware ↔ ML co-evolved (the big picture)

A. The feedback loop (high level)

B. Concrete historical shifts (chronological list)

1. CPUs → GPUs (2008–2014)

2. GPU memory bandwidth wall (2016–2019)

3. Transformers dominate → attention bottleneck (2019–2022)

4. Inference becomes the cost center (2022–2024)

5. Long context breaks hardware assumptions (2023–2025)

No comments:

About Me

Popular Posts