Adhyayan: Hardware of Primal Attention

Monday, December 29, 2025

Hardware of Primal Attention

Below is a hardware-level translation of that write-up—i.e., what self-attention / Primal-Attention really means once you build it in silicon.

I’ll deliberately avoid ML jargon where possible and instead describe datapaths, memory traffic, and compute units.

1. What vanilla self-attention looks like in hardware

Mathematical view

$\text{Attention}(Q,K,V) = \text{softmax}(QK^T)V$

Hardware reality

This explodes into three very expensive operations:

QKᵀ matrix multiply
- Shape: $N \times d$ × $d \times N$
- Cost: O(N²d) MACs
- Requires materializing or streaming an N×N matrix
Softmax
- Exponentials + reductions
- Poor accelerator utilization
- Requires global normalization
Multiply by V
- Another large matrix multiply

Hardware pain points

❌ Quadratic memory bandwidth (N²)
❌ Quadratic SRAM/DRAM pressure
❌ Poor data reuse
❌ Latency scales with sequence length
❌ Difficult to pipeline

This is why attention dominates area, power, and latency on AI accelerators.

2. What “asymmetric kernel SVD” means in hardware terms

Key idea

Instead of explicitly computing:

$QK^T$

We factor the interaction into low-rank projections:

$QK^T \;\approx\; (QW_Q)(KW_K)^T$

Where:

$W_Q, W_K$ are learned projection matrices
Rank $s \ll N$

3. Hardware translation of Primal-Attention

Replace this (vanilla attention)


Q ──┐
    ├── massive GEMM ── softmax ── GEMM ── output
K ──┘

With this (Primal-Attention)


Q ── GEMM ──► Q_proj ──┐
                       ├── small GEMM ──► output
K ── GEMM ──► K_proj ──┘
V ── GEMM ──────────────────────────────┘

What changed?

No N×N matrix
Only linear projections
Only small intermediate tensors

4. What computation units are actually doing

Compute pattern

All operations reduce to:

Dense GEMMs
Vector inner products
Streaming reductions

Which means:

✔ Systolic arrays stay busy
✔ No softmax unit required
✔ No quadratic buffers
✔ Fully pipelinable

5. Complexity in hardware terms

Aspect	Vanilla Attention	Primal-Attention
MACs	O(N²d)	O(N·p·s)
SRAM usage	O(N²)	O(N·s)
DRAM traffic	Very high	Linear
Latency scaling	Quadratic	Linear
Pipelining	Difficult	Easy

This is the difference between “GPU-only” and “edge-deployable.”

6. What “objective reaches zero under KKT” means in silicon

Translated to hardware language:

You don’t need to explicitly compute SVD.

Instead:

Training learns projections that implicitly satisfy optimality
Inference only runs simple linear layers
No eigen-decomposition
No iterative solvers
No dynamic control flow

➡️ Static, compile-time–friendly graphs

This is exactly what hardware wants.

7. Why accelerators love this

Compared to standard attention

Primal-Attention:

Looks like MLP layers
Uses fixed-shape GEMMs
Has predictable memory access
Scales linearly with sequence length

This aligns perfectly with:

TPUs
NPUs
Edge accelerators
FPGA pipelines

8. Intuition in hardware language

Self-attention is usually “build a giant interaction matrix and then use it.”

Primal-Attention is “learn a small set of interaction directions and project everything onto them.”

Hardware translation:

From global quadratic interaction
To local linear projections + reductions

9. Why this matters long-term

This reframing means:

Attention no longer requires special hardware
Transformers become streamable
Long-context models stop being memory-bound
Kernel theory becomes hardware optimization theory

In short:

Primal-Attention turns attention from a memory problem into a compute problem — and hardware is very good at compute.

Sketch a datapath diagram
Compare this to FlashAttention in hardware terms
Explain how this maps onto systolic arrays
Or show how it enables true streaming Transformers

Adhyayan