Saturday, January 03, 2026

Doing a lot at once

 why-do-we-even-need-simd-instructions 

unlike many of the previous architectural innovations, MMX and its successors all require software updates in order to benefit from the new instructions, which takes much longer because not only do developers have to write and test new code, but it must propagate through intermediate layers (such as an operating system release cycle) before finding its way into the hands of end customers. 

Integrating FPU like a platform plugging in

Superscalar_processor

That detail is easy to miss if you only look at FLOPS charts. One of the barrier why “train FP16 then PTQ-quantize” is not the same as having native low-precision pathways that actually save compute and memory at the same time. That’s why Nvidia’s advantage isn’t merely that it “supports FP8,” but that it keeps moving the boundary of what is practically trainable/servable at lower precision without asking the ecosystem to rewrite itself each generation.

"Parts of the Transformer Engine that previously existed as “software over Tensor Cores” are being moved into hardware - this is ASIC-like behavior as it hs hardening a workload pattern into silicon." similar to  Instead, an alternative is to offload such frequent computations to a FPGA, which can perform the same computation in a fraction of the time than software would take

Microarchitecture what happens beneath

The raw power of Intelligence from Samuel albanie reflections 2025

fast inverse square root

https://martin.kleppmann.com/2025/12/08/ai-formal-verification.html

No comments: