Acknowledgements
https://cvw.cac.cornell.edu/gpu-architecture/gpu-memory/comparison_cpu_mem
Efficient Deep Learning Computing
6-172-performance-engineering-of-software-systems lecture-1-introduction-and-matrix-multiplication
deeplearning/performance/dl-performance-matrix-multiplication
why-gemm-is-at-the-heart-of-deep-learning
MIT introtodeeplearning - Jan 2026
arith24.arithsymposium.org/slides/s7-koenig.pdf
A Domain-Specific Language (DSL) is:
Specialized: built for one domain (e.g., GPU kernels, ML ops, hardware verification, SQL for databases, Verilog for chips, HTML for markup).
Higher-level than writing raw kernels or assembly.
Designed to let developers write small, compact code that compiles down to extremely optimized low-level instructions.
Triton (for writing high-performance GPU kernels in Python-like syntax)
TVM (tensor compiler DSL)
Halide (image processing pipelines)
CUDA C++ (not exactly a DSL but domain-specific)
These DSLs exist because raw CUDA/HIP/PTX/ISA is too low-level and optimizing by hand is time-consuming.
The context of the tweet you quoted:
"You're probably not ready for how dramatically the world of GPU kernel engineering will change in the next 12 months.
It's not about the next DSL. @makora_ai is generating SOTA kernels automatically. @AIatAMD GPUs running MLA twice as fast as H100…”
What this implies:
Humans write kernels by hand
Or humans write DSL code (Triton/Halide/etc.)
Performance tuning is manual: block sizes, tiling, memory coalescing, scheduling, vectorization, warp patterns, bank-conflict avoidance, etc.
AI-generated kernels outperform hand-written or DSL-written kernels
The AI does:
Tiling strategies
Warp scheduling
Shared memory optimization
Loop unrolling
Prefetching
Vectorization
Architecture-specific MLIR lowering
Autotuning
Essentially:
๐ The AI becomes the DSL.
๐ The human doesn’t need to write kernels at all.
This is similar to what happened in compilers, but far more powerful: the model learns to produce SOTA kernels through gradient-free optimization, search, or reinforcement.
Makora, Sakana AI, Modular, and others are exploring this.
Because DSLs are a bridge between “human-friendly code” and “hardware-friendly code.”
If AI can:
Read high-level descriptions
Autogenerate the optimal kernel
Autotune across GPU architectures
Validate correctness
Benchmark and refine recursively
…then humans no longer need to write DSL code.
This is the same argument people are making about AI replacing shader languages, query languages, even CUDA in the long run.
When they mention:
“AMD MLA running twice as fast as H100 across sizes below”
This is referencing benchmark graphs where AMD’s Mixed-Precision Matrix (MLA) architecture + AI-generated kernels achieve dramatically better throughput on certain batch sizes than NVIDIA H100 using standard libraries.
Not because the hardware is inherently 2× faster, but because:
Handwritten/H100 kernels stagnate (NVIDIA keeps things locked behind cuBLAS/cuDNN)
AI-generated kernels exploit architecture-specific tricks on AMD that humans never wrote
This is part of the narrative that AI-discovered kernels will unlock latent performance.
DSL (Domain-Specific Language) is a small programming language specialized for a narrow task—like writing GPU kernels.
The point of the tweet is:
Future GPU performance won’t come from humans writing new DSLs. It will come from AI agents that automatically generate and autotune kernels that outperform hand-coded or DSL-coded ones. The DSL layer becomes unnecessary.
If you want, I can also explain:
How these AI kernel-generators work internally
Why recursive refinement/search/autotuning is the real bottleneck
How this relates to your earlier question about “why it takes time to get it right”
What this means for NVIDIA, AMD, and future ML compilers
Just tell me which direction you want to explore.
Yes — it is related, but indirectly.
AI-generated GPU kernels are an example of LLMs + agents + search doing a task more efficiently and more quickly than a human could. But the mechanism is not “LLM → output once → fast result.”
It’s actually:
LLM + recursive refinement + performance search + autotuning = dramatically faster/more efficient kernels in the end.
So the end result is fast, but the process may still be slow, iterative, and compute-intensive.
Let me break it down so it’s very clear:
You're asking whether the new trend (Makora, Sakana AI, "AI discovers kernels", etc.) is basically about…
LLMs becoming faster at tasks?
LLMs producing correct outputs more efficiently?
Recursive refilling / multi-pass reasoning making LLMs faster?
Or something deeper?
Answer: It’s deeper.
This is about using LLMs as search engines over code space.
Even though the process is heavy, the results are:
Faster kernels (run blazingly fast)
More optimized than human or DSL code
Produced with far less human engineering time
So yes:
The AI is making the workflow quicker and the output more efficient, even if the LLM itself isn’t “thinking faster.”
But here's the key nuance most people miss:
These systems use LLMs as part of a pipeline like this:
LLM proposes candidate kernel
Compile kernel
Benchmark it on real hardware
Measure FLOPs, latency, memory bottlenecks
Feed performance data back to the LLM
LLM proposes improved version
Repeat hundreds or thousands of times
This is recursive
This is search
This is autotuning
This is model-guided optimization
So the process is long, but automated.
Humans would take days→weeks to achieve what the AI discovers in hours.
This is exactly the same phenomenon described in several papers:
This is why:
chain-of-thought
self verification
multi-sample self-consistency
tool use
search
reflection
recursive refinement
…have become the dominant approaches.
You already saw papers showing:
Recursive processes reduce accuracy if done wrong
Multiple passes increase correctness
Search + verification beats single-shot LLM outputs
The GPU kernel example is the engineering manifestation of the same idea:
Complex tasks require iterative improvement, not single-shot LLM answers.
They produce insanely optimized kernels faster than humans.
The LLM is not magically smarter or faster internally.
It is part of a pipeline that:
generates candidates
tests them
refines them recursively
uses feedback loops
converges on the best solution
This is a meta-algorithm, not a single LLM capability.
The breakthrough is not that LLMs are “efficient thinkers.”
The breakthrough is that LLMs can drive a search process that efficiently discovers superhuman solutions.
This is the pattern emerging in:
GPU kernel generation
circuit layout
protein design
RL-powered LLM agents
compiler autotuning
reasoning tasks
synthetic data generation
proof optimization
In every domain, the LLM is not replacing the expert —
it’s replacing the entire exploration and optimization cycle.
AI-discovered GPU kernels are related to LLMs doing tasks “more efficiently,” but not because the LLM is fast in one shot. The efficiency comes from using LLMs inside a recursive, feedback-driven search pipeline that automatically explores thousands of kernel variants. The speed and quality of the final result comes from this optimization loop — not from the LLM itself being inherently faster.
If you want, I can also explain:
Why this recursive-search pattern is becoming universal in all strong AI systems
What problems it solves (and why your ChipAgents experience matches it)
How you can describe this in research language (distributional shift, optimization landscapes, model collapse, etc.)