Adhyayan: How to Learn about the hardware behind AI

Friday, December 26, 2025

How to Learn about the hardware behind AI

A hardware-first-path-to-understanding GPUs, TPUs, and LLMs

How matrix multiplication, kernels, and accelerator design explain modern machine learning systems

d2l.ai is about what work you ask the hardware to do; hardware texts are about how the machine executes that work.

They meet in the middle — but they answer different questions.

Let me explain the connection clearly, and when d2l.ai becomes useful if you care about hardware.

The clean mental separation

d2l.ai answers:

What is an MLP, CNN, Transformer?
What computations do they perform?
What tensors exist, and how do they flow?
What operations dominate training and inference?

Hardware resources answer:

How are those computations executed?
How is data moved and reused?
What limits throughput?
What architectural choices matter?

Think of d2l.ai as defining the workload, and hardware material as defining the machine.

The actual connection point: linear algebra

Almost everything in d2l.ai reduces to:

matrix multiplication
elementwise operations
reductions (sum, softmax, norm)

That’s the bridge.

An MLP in d2l.ai:

looks like layers and equations
is GEMMs + activation kernels on hardware

A Transformer block:

looks like attention and feedforward layers
is batched matmuls + memory movement + reductions

d2l.ai gives you semantic structure; hardware explains execution reality.

Why Chapter 7 (Patterson) feels “more GPU/TPU-like”

Chapter 7 teaches:

throughput machines
parallelism
data movement
accelerator design principles

That’s why it maps cleanly to GPUs and TPUs.

But it doesn’t tell you:

what workloads actually look like
why attention is painful for memory
why inference is sequential

d2l.ai fills that gap.

When d2l.ai becomes useful for a hardware-minded person

❌ Too early

If you read d2l.ai before understanding hardware:

models feel abstract
performance implications are invisible
everything looks “just math”

✅ Useful after hardware basics

Once you understand:

memory hierarchies
matmul dominance
bandwidth vs compute limits

d2l.ai suddenly becomes:

a workload specification
a way to reason about why hardware behaves the way it does

You start asking:

“How many matmuls does this layer induce?”
“Is this memory- or compute-bound?”
“Why does this scale poorly?”

That’s the real payoff.

Concrete example: MLP

In d2l.ai:

An MLP layer computes Y = XW + b

In hardware terms:

XW → GEMM kernel
+ b → elementwise kernel
activation → elementwise kernel
memory reads dominate unless fused

Without hardware knowledge, you stop at the equation.
With hardware knowledge, you see:

kernel launches
memory traffic
fusion opportunities

Concrete example: Attention

In d2l.ai:

Attention = softmax(QKᵀ / √d) V

In hardware terms:

QKᵀ → large matmul
softmax → reduction + exp
memory-bound unless tiled
inference is sequential

This is where hardware intuition transforms d2l.ai from a textbook into a systems guide.

So how should you position d2l.ai?

As an extra reference, exactly as you said.

A clean description you can use:

Dive into Deep Learning (d2l.ai) is useful as a workload-level companion once hardware fundamentals are clear. It describes the structure of models and training loops, which can then be interpreted through a hardware and systems lens.

One-sentence takeaway

d2l.ai tells you what the model does; hardware tells you how expensive that is and why.

Once you have both, ML systems stop being mysterious.

Adhyayan