Adhyayan: CMU Intro to Deep Learning

This is very interesting. The video talks about the diagrams in the Attention is all you need paper.

Why self attention? (from the paper)

One is the total computational complexity per layer. Another is the amount of computation that can be parallelized, as measured by the minimum number of sequential operations required. The third is the path length between long-range dependencies in the network.

https://deeplearning.cs.cmu.edu/F25

This is very interesting. foundations-transformers-architecture/chapter-2-attention-mechanism-core-concepts/scaled-dot-product-attention

deeplearningbook

https://people.cs.pitt.edu/~kovashka/cs1678_sp21/dl_06_transformers.pdf

https://cocalc.com/github/leechanwoo-kor/coursera/blob/main/deep-learning-specialization/course-5-sequence-models/C5_W4_A1_Transformer_Subclass_v1.ipynb?utm_source=chatgpt.com

d2l.ai/chapter_attention-mechanisms-and-transformers

https://d2l.ai/chapter_attention-mechanisms-and-transformers/attention-scoring-functions.html#scaled-dot-product-attention

https://web.stanford.edu/~jurafsky/slp3/ speech and language processing

"Logarithmic Floating-Point Formats (LogFMT-nBit)"

scaled dot-product attention (and especially the gradient w.r.t. V) connects very directly to hardware design and performance, particularly on GPUs, TPUs, and AI accelerators. I’ll relate the math to what actually happens in hardware.

1. Recall the core computation

Forward pass:

$\text{Attention}(Q,K,V) = \underbrace{\text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right)}_{A}\; V$

Backward pass (key part):

$\frac{\partial L}{\partial V} = A^T \frac{\partial L}{\partial \text{Output}}$

This already looks like linear algebra kernels — which is exactly how hardware sees it.

2. Hardware view: everything is GEMM

On hardware, attention is dominated by matrix multiplications (GEMMs):

Math object	Hardware interpretation
$QK^T$	Large matrix multiply
softmax	Elementwise ops + reductions
$AV$	Another matrix multiply
$\frac{\partial L}{\partial V} = A^T G$	Matrix multiply

👉 Gradients w.r.t. V are “easy” for hardware:
They are just one GEMM, no softmax, no exponentials.

That’s why:

Backprop through V is cheaper than through Q or K
Hardware kernels are often optimized assuming this structure

3. Why scaling matters for hardware, not just math

The scaling term

$\frac{QK^T}{\sqrt{d_k}}$

Hardware reason (not just numerical analysis)

Without scaling:

Dot products grow with $d_k$
Softmax becomes very peaky
Gradients become nearly zero for most entries

On hardware this causes:

Wasted compute (multiplying by near-zero)
Poor utilization of vector units
Underflow in FP16 / BF16

Scaling:

Keeps values in a range where softmax + gradient stay active
Makes low-precision arithmetic viable
Directly enables modern accelerators to run attention efficiently

👉 This is why scaling was necessary for GPUs/TPUs, not optional.

4. Memory bandwidth dominates attention

For attention, memory movement is more expensive than math.

For gradients w.r.t. V:

$\frac{\partial L}{\partial V} = A^T G$

Hardware consequences:

Must read A (attention matrix)
Must read gradient G
Must write gradient V

This is why:

Attention is often memory-bound, not compute-bound
Hardware designs try to avoid storing A explicitly

5. FlashAttention: gradient w.r.t. V in hardware terms

Modern kernels (FlashAttention, xFormers) exploit this:

Instead of:

Compute A
Store A to HBM
Multiply AᵀG later

They do:

Recompute attention blocks on the fly
Accumulate $\partial L / \partial V$ streaming through SRAM

This works especially well because:

Gradient w.r.t. V is linear
No need to backprop through softmax for V

👉 Hardware takeaway:

The structure of $\partial L / \partial V$ is what enables FlashAttention-style kernels.

6. Why V is hardware-friendly (compared to Q and K)

Gradient	Hardware difficulty
w.r.t. V	✅ Single GEMM
w.r.t. Q	❌ GEMM + softmax Jacobian
w.r.t. K	❌ GEMM + softmax Jacobian

This leads to:

Simpler dataflow
Fewer synchronization points
Better pipelining

That’s why accelerator papers often emphasize attention output and V gradients.

7. Accelerator design implications

Modern AI accelerators are designed around attention:

Systolic arrays → optimized for GEMMs like $A^T G$
On-chip SRAM → holds tiles of Q/K/V
Low-precision units (FP16/BF16) → enabled by scaling
Fused kernels → softmax + matmul in one pass

Gradient w.r.t. V fits perfectly into this model.

8. Big picture

Scaled dot-product attention is shaped by hardware realities.

Scaling → stable low-precision arithmetic
Gradient w.r.t. V → pure matrix multiply
Memory layout → streaming-friendly
Kernel fusion → avoids bandwidth bottlenecks

This is why attention looks the way it does today.

Preface: From Attention Theory to Hardware Reality

My interest in the hardware implications of scaled dot-product attention started in an unexpected place: while watching the CMU Deep Learning Bootcamp homework video. The video briefly referenced attention as a standard building block, but glossed over the details of why it is formulated the way it is and how it behaves during training. That omission pushed me to look more closely at the mathematical structure of attention, especially the role of scaling and the gradients flowing through the value matrix $V$ .

From there, I traced the idea across several canonical resources. The original formulation appears in Attention Is All You Need, where Vaswani et al. introduce scaled dot-product attention and motivate the $\sqrt{d_k}$ factor as a way to stabilize training. While the paper focuses on architectural innovation rather than implementation details, it establishes the exact algebraic form that all modern systems now rely on.

To understand the mechanics behind this formulation, I turned to Goodfellow, Bengio, and Courville’s Deep Learning, which—although it predates Transformers—provides the necessary foundations: softmax behavior, gradient propagation, and numerical stability. These concepts explain why unscaled dot products can cause gradient issues and why scaling becomes essential in high-dimensional settings.

I then consulted Dive into Deep Learning (d2l.ai), which offers a more modern and explicit treatment of attention mechanisms. D2L presents scaled dot-product attention in executable form, making the separation between queries, keys, and values concrete. Although it does not derive gradients explicitly, it makes clear that the attention output is a matrix product involving $V$ , which has important consequences for backpropagation.

Additional clarification came from calculus-based explanations of softmax and matrix derivatives, as well as secondary resources such as APX/lecture-style notes, which bridge the gap between abstract math and implementation intuition. Across these sources, a pattern emerged: while the forward computation of attention is well documented, the backward pass—especially the gradient with respect to $V$ —is rarely emphasized, despite being structurally simple.

This observation naturally led to a hardware-oriented question:
if the gradient with respect to $V$ is essentially a matrix multiplication, what does that imply for how attention is implemented and optimized on real hardware?

In the remainder of this section, I shift perspective from theory to systems. Building on the standard formulation from Attention Is All You Need and the mathematical grounding provided by Goodfellow and D2L, we now examine how attention maps onto modern accelerators. In particular, we will look at three key ideas that explain why scaled dot-product attention is not just mathematically convenient, but also exceptionally well suited to GPUs and specialized AI hardware.

Scaled dot-product attention maps almost perfectly onto the primitives of modern accelerators: large matrix multiplies, reductions, and fused elementwise operations.
This alignment is not accidental — it is one of the reasons Transformers scale so well on GPUs and TPUs.

1. Attention becomes GEMMs on accelerators

Modern accelerators (GPUs, TPUs) are built around:

Matrix multiply engines (CUDA cores, Tensor Cores, systolic arrays)
High-throughput vectorized arithmetic

Scaled dot-product attention decomposes cleanly into:

$QK^T \;\rightarrow\; \text{softmax} \;\rightarrow\; AV$

From a hardware perspective:

$QK^T$ → GEMM
$AV$ → GEMM
$\partial L / \partial V = A^T G$ → GEMM

Your article’s hardware portion explains that attention is mostly just matrix multiplication, which is exactly what accelerators are best at.

This directly answers how it maps.

2. Scaling enables low-precision hardware

Accelerators rely heavily on:

FP16
BF16
TensorFloat-32

The $\frac{1}{\sqrt{d_k}}$ scaling:

Keeps activations in a numerically stable range
Prevents softmax saturation
Preserves useful gradients

In hardware terms:

Prevents underflow/overflow
Keeps Tensor Cores fully utilized
Makes large-scale training feasible

Your hardware section ties this scaling factor directly to numerical constraints of accelerator arithmetic, not just abstract optimization theory.

3. Memory and dataflow dominate performance

Modern accelerators are:

Compute-rich
Memory-bandwidth limited

The article explains:

Why storing the attention matrix $A$ is expensive
Why recomputation is often cheaper than memory access
Why gradients w.r.t. $V$ are hardware-friendly

This leads naturally to:

FlashAttention
Kernel fusion
On-chip SRAM tiling

All of this is squarely about how attention executes on real accelerators.

The structure of scaled dot-product attention is not just mathematically convenient; it maps directly onto the execution model of modern accelerators. GPUs and TPUs are optimized for large matrix multiplications, low-precision arithmetic, and streaming dataflow, all of which appear naturally in attention. Even the gradient with respect to the value matrix reduces to a single matrix multiply. In this sense, attention is not merely hardware-accelerated — it is hardware-shaped.

Adhyayan

Wednesday, December 24, 2025

CMU Intro to Deep Learning