CMU Introduction to Deep Learning
This is very interesting. The video talks about the diagrams in the Attention is all you need paper.
Why self attention? (from the paper)
One is the total computational complexity per layer. Another is the amount of computation that can be parallelized, as measured by the minimum number of sequential operations required. The third is the path length between long-range dependencies in the network.
https://deeplearning.cs.cmu.edu/F25
This is very interesting. foundations-transformers-architecture/chapter-2-attention-mechanism-core-concepts/scaled-dot-product-attention
https://people.cs.pitt.edu/~kovashka/cs1678_sp21/dl_06_transformers.pdf
d2l.ai/chapter_attention-mechanisms-and-transformers
https://web.stanford.edu/~jurafsky/slp3/ speech and language processing
"Logarithmic Floating-Point Formats (LogFMT-nBit)"
scaled dot-product attention (and especially the gradient w.r.t. V) connects very directly to hardware design and performance, particularly on GPUs, TPUs, and AI accelerators. I’ll relate the math to what actually happens in hardware.
1. Recall the core computation
Forward pass:
Backward pass (key part):
This already looks like linear algebra kernels — which is exactly how hardware sees it.
2. Hardware view: everything is GEMM
On hardware, attention is dominated by matrix multiplications (GEMMs):
| Math object | Hardware interpretation |
|---|---|
| Large matrix multiply | |
| softmax | Elementwise ops + reductions |
| Another matrix multiply | |
| Matrix multiply |
๐ Gradients w.r.t. V are “easy” for hardware:
They are just one GEMM, no softmax, no exponentials.
That’s why:
-
Backprop through V is cheaper than through Q or K
-
Hardware kernels are often optimized assuming this structure
3. Why scaling matters for hardware, not just math
The scaling term
Hardware reason (not just numerical analysis)
Without scaling:
-
Dot products grow with
-
Softmax becomes very peaky
-
Gradients become nearly zero for most entries
On hardware this causes:
-
Wasted compute (multiplying by near-zero)
-
Poor utilization of vector units
-
Underflow in FP16 / BF16
Scaling:
-
Keeps values in a range where softmax + gradient stay active
-
Makes low-precision arithmetic viable
-
Directly enables modern accelerators to run attention efficiently
๐ This is why scaling was necessary for GPUs/TPUs, not optional.
4. Memory bandwidth dominates attention
For attention, memory movement is more expensive than math.
For gradients w.r.t. V:
Hardware consequences:
-
Must read A (attention matrix)
-
Must read gradient G
-
Must write gradient V
This is why:
-
Attention is often memory-bound, not compute-bound
-
Hardware designs try to avoid storing A explicitly
5. FlashAttention: gradient w.r.t. V in hardware terms
Modern kernels (FlashAttention, xFormers) exploit this:
Instead of:
-
Compute A
-
Store A to HBM
-
Multiply AแตG later
They do:
-
Recompute attention blocks on the fly
-
Accumulate streaming through SRAM
This works especially well because:
-
Gradient w.r.t. V is linear
-
No need to backprop through softmax for V
๐ Hardware takeaway:
The structure of is what enables FlashAttention-style kernels.
6. Why V is hardware-friendly (compared to Q and K)
| Gradient | Hardware difficulty |
|---|---|
| w.r.t. V | ✅ Single GEMM |
| w.r.t. Q | ❌ GEMM + softmax Jacobian |
| w.r.t. K | ❌ GEMM + softmax Jacobian |
This leads to:
-
Simpler dataflow
-
Fewer synchronization points
-
Better pipelining
That’s why accelerator papers often emphasize attention output and V gradients.
7. Accelerator design implications
Modern AI accelerators are designed around attention:
-
Systolic arrays → optimized for GEMMs like
-
On-chip SRAM → holds tiles of Q/K/V
-
Low-precision units (FP16/BF16) → enabled by scaling
-
Fused kernels → softmax + matmul in one pass
Gradient w.r.t. V fits perfectly into this model.
8. Big picture
Scaled dot-product attention is shaped by hardware realities.
-
Scaling → stable low-precision arithmetic
-
Gradient w.r.t. V → pure matrix multiply
-
Memory layout → streaming-friendly
-
Kernel fusion → avoids bandwidth bottlenecks
This is why attention looks the way it does today.
Preface: From Attention Theory to Hardware Reality
My interest in the hardware implications of scaled dot-product attention started in an unexpected place: while watching the CMU Deep Learning Bootcamp homework video. The video briefly referenced attention as a standard building block, but glossed over the details of why it is formulated the way it is and how it behaves during training. That omission pushed me to look more closely at the mathematical structure of attention, especially the role of scaling and the gradients flowing through the value matrix .
From there, I traced the idea across several canonical resources. The original formulation appears in Attention Is All You Need, where Vaswani et al. introduce scaled dot-product attention and motivate the factor as a way to stabilize training. While the paper focuses on architectural innovation rather than implementation details, it establishes the exact algebraic form that all modern systems now rely on.
To understand the mechanics behind this formulation, I turned to Goodfellow, Bengio, and Courville’s Deep Learning, which—although it predates Transformers—provides the necessary foundations: softmax behavior, gradient propagation, and numerical stability. These concepts explain why unscaled dot products can cause gradient issues and why scaling becomes essential in high-dimensional settings.
I then consulted Dive into Deep Learning (d2l.ai), which offers a more modern and explicit treatment of attention mechanisms. D2L presents scaled dot-product attention in executable form, making the separation between queries, keys, and values concrete. Although it does not derive gradients explicitly, it makes clear that the attention output is a matrix product involving , which has important consequences for backpropagation.
Additional clarification came from calculus-based explanations of softmax and matrix derivatives, as well as secondary resources such as APX/lecture-style notes, which bridge the gap between abstract math and implementation intuition. Across these sources, a pattern emerged: while the forward computation of attention is well documented, the backward pass—especially the gradient with respect to —is rarely emphasized, despite being structurally simple.
This observation naturally led to a hardware-oriented question:
if the gradient with respect to is essentially a matrix multiplication, what does that imply for how attention is implemented and optimized on real hardware?
In the remainder of this section, I shift perspective from theory to systems. Building on the standard formulation from Attention Is All You Need and the mathematical grounding provided by Goodfellow and D2L, we now examine how attention maps onto modern accelerators. In particular, we will look at three key ideas that explain why scaled dot-product attention is not just mathematically convenient, but also exceptionally well suited to GPUs and specialized AI hardware.
Scaled dot-product attention maps almost perfectly onto the primitives of modern accelerators: large matrix multiplies, reductions, and fused elementwise operations.
This alignment is not accidental — it is one of the reasons Transformers scale so well on GPUs and TPUs.
1. Attention becomes GEMMs on accelerators
Modern accelerators (GPUs, TPUs) are built around:
-
Matrix multiply engines (CUDA cores, Tensor Cores, systolic arrays)
-
High-throughput vectorized arithmetic
Scaled dot-product attention decomposes cleanly into:
From a hardware perspective:
-
→ GEMM
-
→ GEMM
-
→ GEMM
Your article’s hardware portion explains that attention is mostly just matrix multiplication, which is exactly what accelerators are best at.
This directly answers how it maps.
2. Scaling enables low-precision hardware
Accelerators rely heavily on:
-
FP16
-
BF16
-
TensorFloat-32
The scaling:
-
Keeps activations in a numerically stable range
-
Prevents softmax saturation
-
Preserves useful gradients
In hardware terms:
-
Prevents underflow/overflow
-
Keeps Tensor Cores fully utilized
-
Makes large-scale training feasible
Your hardware section ties this scaling factor directly to numerical constraints of accelerator arithmetic, not just abstract optimization theory.
3. Memory and dataflow dominate performance
Modern accelerators are:
-
Compute-rich
-
Memory-bandwidth limited
The article explains:
-
Why storing the attention matrix is expensive
-
Why recomputation is often cheaper than memory access
-
Why gradients w.r.t. are hardware-friendly
This leads naturally to:
-
FlashAttention
-
Kernel fusion
-
On-chip SRAM tiling
All of this is squarely about how attention executes on real accelerators.
The structure of scaled dot-product attention is not just mathematically convenient; it maps directly onto the execution model of modern accelerators. GPUs and TPUs are optimized for large matrix multiplications, low-precision arithmetic, and streaming dataflow, all of which appear naturally in attention. Even the gradient with respect to the value matrix reduces to a single matrix multiply. In this sense, attention is not merely hardware-accelerated — it is hardware-shaped.
No comments:
Post a Comment