Why Hardware engineers should learn ML
The map of mutual influence — hardware shaping ML algorithms, and ML pushing hardware changes back.
A. The feedback loop (high level)
-
Hardware constraint appears
→ limited memory, bandwidth, or parallelism -
ML adapts algorithmically
→ new architectures, training tricks, inference strategies -
Those algorithms become dominant workloads
→ hardware vendors optimize for them -
New hardware capabilities appear
→ ML researchers exploit them, creating new algorithms
This loop repeats every ~2–4 years.
B. Concrete historical shifts (chronological list)
1. CPUs → GPUs (2008–2014)
Hardware reality
-
CPUs: low parallelism, strong control flow
-
GPUs: massive SIMD throughput, weak branching
ML adaptation
-
Shift from symbolic ML → dense linear algebra
-
CNNs, matrix multiplies, batch processing
Hardware response
-
CUDA, cuDNN
-
Tensor cores later added specifically for GEMMs
2. GPU memory bandwidth wall (2016–2019)
Hardware reality
-
FLOPs grew faster than HBM bandwidth
-
Models became memory-bound, not compute-bound
ML adaptation
-
Layer normalization
-
Fused ops
-
Activation checkpointing
-
Attention replaces RNNs (better parallelism)
Hardware response
-
Larger on-chip SRAM
-
Wider memory buses
-
Operator fusion in compilers (XLA, TVM)
3. Transformers dominate → attention bottleneck (2019–2022)
Hardware reality
-
Attention is quadratic
-
KV cache explodes with sequence length
-
Inference limited by memory movement, not math
ML adaptation
-
Sparse attention
-
FlashAttention
-
KV caching
-
Rotary embeddings (better reuse)
Hardware response
-
Tensor cores optimized for smaller datatypes
-
SRAM tiling for attention
-
Hopper adds Transformer Engine
4. Inference becomes the cost center (2022–2024)
Hardware reality
-
Training is episodic, inference is continuous
-
Serving costs dominate energy + capex
ML adaptation
-
Quantization (INT8, FP8)
-
Distillation
-
Speculative decoding
-
Post-training alignment (instead of retraining)
Hardware response
-
FP8 native support
-
INT4/INT8 accelerators
-
KV cache residency optimizations
5. Long context breaks hardware assumptions (2023–2025)
Hardware reality
-
You cannot fit million-token attention in memory
-
DRAM access dominates energy cost
ML adaptation
-
Chunked processing
-
Memory-augmented inference
-
Post-training long-context alignment (QwenLong-L1.5)
-
Recurrence returns (but hardware-friendly)
Hardware implication (still unfolding)
-
Streaming-first accelerators
-
Larger on-chip memory
-
Better support for stateful inference
This is where QwenLong-L1.5 sits.
No comments:
Post a Comment