Long context didn’t scale — memory did
Core thesis:
QwenLong-L1.5 shows a quiet shift: instead of forcing hardware to handle absurd context sizes, ML is adapting again — chunking, summarizing, and streaming state.
Hardware angle:
-
Why million-token attention is impossible
-
Chunked inference aligns with cache hierarchies
-
Post-training enables new execution patterns without new silicon
Key insight:
“The future of long-context models is not bigger windows — it’s better memory discipline.”
No comments:
Post a Comment