Friday, December 26, 2025

Inference Is the New Training

 Why post-training matters more than pre-training now

Core thesis:
Once models became huge, training stopped being the bottleneck. Inference — latency, throughput, power — became the real systems problem.

Hardware angle:

  • Continuous serving vs episodic training

  • Why quantization is a hardware feature, not an ML trick

  • Why speculative decoding is a scheduling optimization

Key insight:

“A model that’s 5% worse but 2× cheaper wins in production.”

No comments: