Monday, December 29, 2025

Hardware of Primal Attention

 Math of Primal Attention

Below is a hardware-level translation of that write-up—i.e., what self-attention / Primal-Attention really means once you build it in silicon.

I’ll deliberately avoid ML jargon where possible and instead describe datapaths, memory traffic, and compute units.


1. What vanilla self-attention looks like in hardware

Mathematical view

Attention(Q,K,V)=softmax(QKT)V\text{Attention}(Q,K,V) = \text{softmax}(QK^T)V

Hardware reality

This explodes into three very expensive operations:

  1. QKᵀ matrix multiply

    • Shape: N×dN \times d × d×Nd \times N

    • Cost: O(N²d) MACs

    • Requires materializing or streaming an N×N matrix

  2. Softmax

    • Exponentials + reductions

    • Poor accelerator utilization

    • Requires global normalization

  3. Multiply by V

    • Another large matrix multiply

Hardware pain points

  • ❌ Quadratic memory bandwidth (N²)

  • ❌ Quadratic SRAM/DRAM pressure

  • ❌ Poor data reuse

  • ❌ Latency scales with sequence length

  • ❌ Difficult to pipeline

This is why attention dominates area, power, and latency on AI accelerators.


2. What “asymmetric kernel SVD” means in hardware terms

Key idea

Instead of explicitly computing:

QKTQK^T

We factor the interaction into low-rank projections:

QKT    (QWQ)(KWK)TQK^T \;\approx\; (QW_Q)(KW_K)^T

Where:

  • WQ,WKW_Q, W_K are learned projection matrices

  • Rank sNs \ll N


3. Hardware translation of Primal-Attention

Replace this (vanilla attention)

Q ──┐ ├── massive GEMM ── softmax ── GEMM ── output K ──┘

With this (Primal-Attention)

Q ── GEMM ──► Q_proj ──┐ ├── small GEMM ──► output K ── GEMM ──► K_proj ──┘ V ── GEMM ──────────────────────────────┘

What changed?

  • No N×N matrix

  • Only linear projections

  • Only small intermediate tensors


4. What computation units are actually doing

Compute pattern

All operations reduce to:

  • Dense GEMMs

  • Vector inner products

  • Streaming reductions

Which means:

  • ✔ Systolic arrays stay busy

  • ✔ No softmax unit required

  • ✔ No quadratic buffers

  • ✔ Fully pipelinable


5. Complexity in hardware terms

AspectVanilla AttentionPrimal-Attention
MACsO(N²d)O(N·p·s)
SRAM usageO(N²)O(N·s)
DRAM trafficVery highLinear
Latency scalingQuadraticLinear
PipeliningDifficultEasy

This is the difference between “GPU-only” and “edge-deployable.”


6. What “objective reaches zero under KKT” means in silicon

Translated to hardware language:

You don’t need to explicitly compute SVD.

Instead:

  • Training learns projections that implicitly satisfy optimality

  • Inference only runs simple linear layers

  • No eigen-decomposition

  • No iterative solvers

  • No dynamic control flow

➡️ Static, compile-time–friendly graphs

This is exactly what hardware wants.


7. Why accelerators love this

Compared to standard attention

Primal-Attention:

  • Looks like MLP layers

  • Uses fixed-shape GEMMs

  • Has predictable memory access

  • Scales linearly with sequence length

This aligns perfectly with:

  • TPUs

  • NPUs

  • Edge accelerators

  • FPGA pipelines


8. Intuition in hardware language

Self-attention is usually “build a giant interaction matrix and then use it.”

Primal-Attention is “learn a small set of interaction directions and project everything onto them.”

Hardware translation:

  • From global quadratic interaction

  • To local linear projections + reductions


9. Why this matters long-term

This reframing means:

  • Attention no longer requires special hardware

  • Transformers become streamable

  • Long-context models stop being memory-bound

  • Kernel theory becomes hardware optimization theory

In short:

Primal-Attention turns attention from a memory problem into a compute problem — and hardware is very good at compute.



  • Sketch a datapath diagram

  • Compare this to FlashAttention in hardware terms

  • Explain how this maps onto systolic arrays

  • Or show how it enables true streaming Transformers

No comments: