Friday, December 26, 2025

How to Learn about the hardware behind AI

hardware-first-path-to-understanding  GPUs, TPUs, and LLMs

How matrix multiplication, kernels, and accelerator design explain modern machine learning systems



d2l.ai is about what work you ask the hardware to do; hardware texts are about how the machine executes that work.

They meet in the middle — but they answer different questions.

Let me explain the connection clearly, and when d2l.ai becomes useful if you care about hardware.


The clean mental separation

d2l.ai answers:

  • What is an MLP, CNN, Transformer?

  • What computations do they perform?

  • What tensors exist, and how do they flow?

  • What operations dominate training and inference?

Hardware resources answer:

  • How are those computations executed?

  • How is data moved and reused?

  • What limits throughput?

  • What architectural choices matter?

Think of d2l.ai as defining the workload, and hardware material as defining the machine.


The actual connection point: linear algebra

Almost everything in d2l.ai reduces to:

  • matrix multiplication

  • elementwise operations

  • reductions (sum, softmax, norm)

That’s the bridge.

An MLP in d2l.ai:

  • looks like layers and equations

  • is GEMMs + activation kernels on hardware

A Transformer block:

  • looks like attention and feedforward layers

  • is batched matmuls + memory movement + reductions

d2l.ai gives you semantic structure; hardware explains execution reality.


Why Chapter 7 (Patterson) feels “more GPU/TPU-like”

Chapter 7 teaches:

  • throughput machines

  • parallelism

  • data movement

  • accelerator design principles

That’s why it maps cleanly to GPUs and TPUs.

But it doesn’t tell you:

  • what workloads actually look like

  • why attention is painful for memory

  • why inference is sequential

d2l.ai fills that gap.


When d2l.ai becomes useful for a hardware-minded person

❌ Too early

If you read d2l.ai before understanding hardware:

  • models feel abstract

  • performance implications are invisible

  • everything looks “just math”

✅ Useful after hardware basics

Once you understand:

  • memory hierarchies

  • matmul dominance

  • bandwidth vs compute limits

d2l.ai suddenly becomes:

  • a workload specification

  • a way to reason about why hardware behaves the way it does

You start asking:

  • “How many matmuls does this layer induce?”

  • “Is this memory- or compute-bound?”

  • “Why does this scale poorly?”

That’s the real payoff.


Concrete example: MLP

In d2l.ai:

An MLP layer computes Y = XW + b

In hardware terms:

  • XW → GEMM kernel

  • + b → elementwise kernel

  • activation → elementwise kernel

  • memory reads dominate unless fused

Without hardware knowledge, you stop at the equation.
With hardware knowledge, you see:

  • kernel launches

  • memory traffic

  • fusion opportunities


Concrete example: Attention

In d2l.ai:

Attention = softmax(QKᵀ / √d) V

In hardware terms:

  • QKᵀ → large matmul

  • softmax → reduction + exp

  • memory-bound unless tiled

  • inference is sequential

This is where hardware intuition transforms d2l.ai from a textbook into a systems guide.


So how should you position d2l.ai?

As an extra reference, exactly as you said.

A clean description you can use:

Dive into Deep Learning (d2l.ai) is useful as a workload-level companion once hardware fundamentals are clear. It describes the structure of models and training loops, which can then be interpreted through a hardware and systems lens.


One-sentence takeaway

d2l.ai tells you what the model does; hardware tells you how expensive that is and why.

Once you have both, ML systems stop being mysterious. 

No comments: