Why post-training matters more than pre-training now
Core thesis:
Once models became huge, training stopped being the bottleneck. Inference — latency, throughput, power — became the real systems problem.
Hardware angle:
-
Continuous serving vs episodic training
-
Why quantization is a hardware feature, not an ML trick
-
Why speculative decoding is a scheduling optimization
Key insight:
“A model that’s 5% worse but 2× cheaper wins in production.”
No comments:
Post a Comment