Acknowledgements
https://cvw.cac.cornell.edu/gpu-architecture/gpu-memory/comparison_cpu_mem
Efficient Deep Learning Computing
6-172-performance-engineering-of-software-systems lecture-1-introduction-and-matrix-multiplication
deeplearning/performance/dl-performance-matrix-multiplication
why-gemm-is-at-the-heart-of-deep-learning
MIT introtodeeplearning - Jan 2026
arith24.arithsymposium.org/slides/s7-koenig.pdf
A Domain-Specific Language (DSL) is:
Specialized: built for one domain (e.g., GPU kernels, ML ops, hardware verification, SQL for databases, Verilog for chips, HTML for markup).
Higher-level than writing raw kernels or assembly.
Designed to let developers write small, compact code that compiles down to extremely optimized low-level instructions.
Triton (for writing high-performance GPU kernels in Python-like syntax)
TVM (tensor compiler DSL)
Halide (image processing pipelines)
CUDA C++ (not exactly a DSL but domain-specific)
These DSLs exist because raw CUDA/HIP/PTX/ISA is too low-level and optimizing by hand is time-consuming.
The context of the tweet you quoted:
"You're probably not ready for how dramatically the world of GPU kernel engineering will change in the next 12 months.
It's not about the next DSL. @makora_ai is generating SOTA kernels automatically. @AIatAMD GPUs running MLA twice as fast as H100…”
What this implies:
Humans write kernels by hand
Or humans write DSL code (Triton/Halide/etc.)
Performance tuning is manual: block sizes, tiling, memory coalescing, scheduling, vectorization, warp patterns, bank-conflict avoidance, etc.
AI-generated kernels outperform hand-written or DSL-written kernels
The AI does:
Tiling strategies
Warp scheduling
Shared memory optimization
Loop unrolling
Prefetching
Vectorization
Architecture-specific MLIR lowering
Autotuning
Essentially:
๐ The AI becomes the DSL.
๐ The human doesn’t need to write kernels at all.
This is similar to what happened in compilers, but far more powerful: the model learns to produce SOTA kernels through gradient-free optimization, search, or reinforcement.
Makora, Sakana AI, Modular, and others are exploring this.
Because DSLs are a bridge between “human-friendly code” and “hardware-friendly code.”
If AI can:
Read high-level descriptions
Autogenerate the optimal kernel
Autotune across GPU architectures
Validate correctness
Benchmark and refine recursively
…then humans no longer need to write DSL code.
This is the same argument people are making about AI replacing shader languages, query languages, even CUDA in the long run.
When they mention:
“AMD MLA running twice as fast as H100 across sizes below”
This is referencing benchmark graphs where AMD’s Mixed-Precision Matrix (MLA) architecture + AI-generated kernels achieve dramatically better throughput on certain batch sizes than NVIDIA H100 using standard libraries.
Not because the hardware is inherently 2× faster, but because:
Handwritten/H100 kernels stagnate (NVIDIA keeps things locked behind cuBLAS/cuDNN)
AI-generated kernels exploit architecture-specific tricks on AMD that humans never wrote
This is part of the narrative that AI-discovered kernels will unlock latent performance.
DSL (Domain-Specific Language) is a small programming language specialized for a narrow task—like writing GPU kernels.
The point of the tweet is:
Future GPU performance won’t come from humans writing new DSLs. It will come from AI agents that automatically generate and autotune kernels that outperform hand-coded or DSL-coded ones. The DSL layer becomes unnecessary.
If you want, I can also explain:
How these AI kernel-generators work internally
Why recursive refinement/search/autotuning is the real bottleneck
How this relates to your earlier question about “why it takes time to get it right”
What this means for NVIDIA, AMD, and future ML compilers
Just tell me which direction you want to explore.
Yes — it is related, but indirectly.
AI-generated GPU kernels are an example of LLMs + agents + search doing a task more efficiently and more quickly than a human could. But the mechanism is not “LLM → output once → fast result.”
It’s actually:
LLM + recursive refinement + performance search + autotuning = dramatically faster/more efficient kernels in the end.
So the end result is fast, but the process may still be slow, iterative, and compute-intensive.
Let me break it down so it’s very clear:
You're asking whether the new trend (Makora, Sakana AI, "AI discovers kernels", etc.) is basically about…
LLMs becoming faster at tasks?
LLMs producing correct outputs more efficiently?
Recursive refilling / multi-pass reasoning making LLMs faster?
Or something deeper?
Answer: It’s deeper.
This is about using LLMs as search engines over code space.
Even though the process is heavy, the results are:
Faster kernels (run blazingly fast)
More optimized than human or DSL code
Produced with far less human engineering time
So yes:
The AI is making the workflow quicker and the output more efficient, even if the LLM itself isn’t “thinking faster.”
But here's the key nuance most people miss:
These systems use LLMs as part of a pipeline like this:
LLM proposes candidate kernel
Compile kernel
Benchmark it on real hardware
Measure FLOPs, latency, memory bottlenecks
Feed performance data back to the LLM
LLM proposes improved version
Repeat hundreds or thousands of times
This is recursive
This is search
This is autotuning
This is model-guided optimization
So the process is long, but automated.
Humans would take days→weeks to achieve what the AI discovers in hours.
This is exactly the same phenomenon described in several papers:
This is why:
chain-of-thought
self verification
multi-sample self-consistency
tool use
search
reflection
recursive refinement
…have become the dominant approaches.
You already saw papers showing:
Recursive processes reduce accuracy if done wrong
Multiple passes increase correctness
Search + verification beats single-shot LLM outputs
The GPU kernel example is the engineering manifestation of the same idea:
Complex tasks require iterative improvement, not single-shot LLM answers.
They produce insanely optimized kernels faster than humans.
The LLM is not magically smarter or faster internally.
It is part of a pipeline that:
generates candidates
tests them
refines them recursively
uses feedback loops
converges on the best solution
This is a meta-algorithm, not a single LLM capability.
The breakthrough is not that LLMs are “efficient thinkers.”
The breakthrough is that LLMs can drive a search process that efficiently discovers superhuman solutions.
This is the pattern emerging in:
GPU kernel generation
circuit layout
protein design
RL-powered LLM agents
compiler autotuning
reasoning tasks
synthetic data generation
proof optimization
In every domain, the LLM is not replacing the expert —
it’s replacing the entire exploration and optimization cycle.
AI-discovered GPU kernels are related to LLMs doing tasks “more efficiently,” but not because the LLM is fast in one shot. The efficiency comes from using LLMs inside a recursive, feedback-driven search pipeline that automatically explores thousands of kernel variants. The speed and quality of the final result comes from this optimization loop — not from the LLM itself being inherently faster.
If you want, I can also explain:
Why this recursive-search pattern is becoming universal in all strong AI systems
What problems it solves (and why your ChipAgents experience matches it)
How you can describe this in research language (distributional shift, optimization landscapes, model collapse, etc.)
(High-level explanation)
Traditional zk-SNARK proof generation is:
computationally heavy, especially for large circuits, and
typically done by a single prover.
In many real-world systems (blockchains, decentralized protocols, collaborative computations), no single party wants or is able to generate the entire proof alone.
Goal:
Build a protocol where multiple provers can collaboratively (and securely) generate a single zk-SNARK proof without trusting one another and with sublinear communication overhead.
This allows N provers to jointly produce a SNARK proof for a computation without sharing private witnesses.
Key properties:
Scales to many provers (scalable MPC-like structure).
Maintains zero-knowledge and soundness even if all but one provers are malicious.
Works with Groth16-style and Plonkish SNARKs depending on instantiation.
SNARK proof generation relies heavily on multi-scalar multiplications (MSMs) such as:
i∑aiGiIn collaborative settings, the vectors ai (derived from secret witnesses) may be additively shared among provers.
The paper introduces CME, a protocol where:
inputs to the MSM are secret-shared among many provers,
yet the final MSM result is produced without revealing the shares.
CME is the technical core enabling distributed proof generation to be efficient.
The authors apply their scalable collaborative zk-SNARK construction to create a fully decentralized proving pool.
Use case:
A user wants a proof (e.g., for an L2 rollup, ZK-ML, or verification task), but offloads the computation to a network of distributed provers.
Properties:
User never reveals their secret witness.
No trusted coordinator.
Provers work collectively and are rewarded for their computation.
Robust even under adversarial provers.
This is different from traditional “proof outsourcing” because no single prover computes the whole proof, and no special trust assumptions are needed.
The client splits their witness w into additive shares:
w=w1+w2+⋯+wNEach prover only gets wi.
Each prover computes:
their share of intermediate polynomial evaluations,
their contribution to constraint polynomials,
shares of randomizers to preserve zero-knowledge.
The provers collectively compute large MSMs required for SNARK setup:
No single prover sees the full scalars.
Output is the correct MSM needed by the SNARK prover.
This step replaces the most expensive part of proof generation.
Using CME outputs and local shares, the protocol merges everything into a classical zk-SNARK proof format (e.g., Groth16).
The verifier sees a standard SNARK proof — they cannot tell it was produced collaboratively.
The paper demonstrates:
Near-linear scaling as number of provers increases.
Dramatically reduced per-prover load.
Low communication relative to circuit size (sublinear, dominated by aggregated MSM communication).
Practical performance competitive with centralized provers for large circuits.
The construction gives:
Zero-knowledge: No prover learns anything about the witness.
Soundness: Even if all but one provers are malicious, they cannot fabricate a false proof.
Robustness: System continues functioning despite adversarial dropout (threshold variants supported).
For rollups (e.g., zk-rollup networks) where proof generation must be continuous and expensive.
Large neural-network proofs can be broken across many GPUs.
Users submit jobs; provers cooperatively generate proofs and get rewarded.
A user can outsource proof generation without trusting a single prover.
I can provide:
✅ A 1-paragraph summary
✅ A whiteboard-style conceptual explanation
✅ A point-by-point explanation of the math (CME, MSM, polynomial commitments)
✅ A slide-ready summary
✅ A comparison with Boojum, Plonky2, multi-prover Halo2, GKR-based MPC, etc.
✅ Pseudocode or diagrams for CME
"An activity that makes you lose track of time"
Goal: Immediately show the reader what it feels like to lose track of time.
Describe one powerful moment with drones (first flight, first crash/rebuild, a breakthrough, a sunrise flight, etc).
Use sensory detail: what they saw, heard, felt.
Show—not tell—that time disappeared.
What moment made you fall fully into the world of drones?
What did the drone do, and how did you respond?
How did you realize that time had passed without noticing?
Admissions readers get pulled into a scene, not a rรฉsumรฉ.
Goal: Show how the interest took hold and grew naturally.
Briefly describe early fascination (2nd grade, first kits, early experiments).
Focus on curiosity, not accomplishments.
Show how drones became a part of life.
When did you realize this was more than a toy?
What did you start doing on your own (taking apart drones, watching videos, reading)?
How did your curiosity push you to explore more?
Georgia Tech values intrinsic motivation—passion without external pressure.
Goal: Show depth and seriousness (certifications, self-study, building, problem-solving).
Describe learning more complex ideas (aerodynamics, autonomous systems, GPS modules).
Mention certifications, but show what you learned from them.
Include one example of overcoming a problem or failure.
What did you teach yourself that required effort?
What challenge made you learn something important?
What was a moment you felt proud of your skill level?
Demonstrates discipline, growth, and engineering mindset.
Goal: Show how the passion matured into real impact.
Describe an internship, project, mapping work, or structured experience.
Explain the skills learned (planning flights, analyzing data, troubleshooting hardware).
Show how drones went from hobby → tool → future career interest.
What real task did you accomplish using drones?
How did your skills help someone else or solve a real problem?
How did this experience shift your understanding of drones?
Tech schools love when students apply knowledge, not just learn it.
Goal: Show growth into leadership, communication, and community contribution.
Explain how he helped others (workshops, younger students, clubs).
Highlight his ability to explain complex concepts simply.
Show that passion naturally evolves into mentoring.
What did teaching others teach you?
How did you feel watching someone you taught succeed?
How did teaching strengthen your own mastery?
Georgia Tech values collaborative learners who uplift peers.
Goal: Tie drones to personal identity + what makes him human.
Describe what drones mean to him.
Reflect on time disappearing, flow state, focus, resilience, joy.
Connect mindset from drones to how he solves problems or views the world.
Why do you lose track of time while flying or building drones?
What do drones reveal about how you think, learn, and persist?
What personal lessons or traits emerged from this passion?
Reflection shows awareness—critical for strong essays.
Goal: Connect drones → engineering → space → Georgia Tech.
Mention specific principles drones taught him (autonomy, navigation, flight dynamics).
Connect these skills to space exploration or aerospace engineering.
End with a forward-looking statement about what he hopes to do at Georgia Tech.
How does your drone experience prepare you for GT’s Space Program?
What skills or mindset will you bring to their labs/courses/teams?
What excites you about pushing flight beyond Earth?
Shows intentionality: “I didn’t just choose drones. I chose a future.”
Hook: Moment you lost track of time
Early curiosity & spark
Skill-building + challenge overcome
Real-world application (internship/project)
Teaching/mentoring
Reflection (why drones matter to you)
Georgia Tech connection + future vision
On the morning I first flew a drone, I wasn’t thinking about engineering or future careers. I was eight years old, standing in my backyard, staring at a tiny machine with blinking lights. When I pushed the throttle and watched it lift—shaking, wobbling, catching the sunlight—I felt something I didn’t have a word for then. I now know it was curiosity giving way to possibility. Before I realized it, the afternoon had vanished, and I had spent hours learning how to guide a fragile aircraft through the sky. Even today, drones are the one activity that makes time disappear for me.
That first moment of wonder slowly turned into something deeper. By the time I reached middle school, I wasn’t just flying drones—I was opening them, taking them apart, and putting them back together to understand why they worked. I spent weekends rebuilding motors, replacing propellers, and experimenting with flight controllers. The more I learned, the more the world expanded. Suddenly, I wasn’t playing with a toy; I was exploring aerodynamics, autonomous flight, and flight dynamics.
In high school, I decided to formalize this hobby and earned my Part 107 certification, which opened the door to more advanced flying and experimentation. I began building custom drones, incorporating GPS modules, stabilization algorithms, and telemetry systems. My understanding of drones shifted from hobbyist fascination to technical challenge: every malfunction became a puzzle, and every successful flight felt like a small engineering victory.
Last year, I completed an internship with a local drone-mapping startup. My job involved planning survey flights, troubleshooting hardware failures, and analyzing aerial data. For the first time, I saw drones not just as machines but as tools—capable of mapping farmland, assessing infrastructure, and responding to emergencies. Understanding their real-world impact only deepened my interest. I realized how much precise control, systems thinking, and problem-solving mattered—skills that also define aerospace engineering.
Teaching others came next. I started mentoring younger students through a weekend workshop at the community center. Teaching someone how to fly for the first time reminded me of my own early excitement, but it also challenged me to explain concepts like lift, thrust, and gyroscopic stability in simple terms. Watching others discover the same spark I felt at eight made me understand that learning doesn’t just happen through textbooks; it grows through shared curiosity.
Through drones, I have learned the joy of building something, testing it, breaking it, and starting again. I’ve learned how to solve problems when nothing seems to work. I’ve learned how to be patient when a flight controller refuses to calibrate or when a drone crashes on its first test. Most of all, I’ve learned how much I love the feeling of losing track of time while trying to make a machine stay steady in the sky.
What began as a childhood fascination has grown into a passion for flight, systems engineering, and exploration. The intersection of drones and space—autonomy, navigation, flight control, and design—is exactly where I want to build my future. Georgia Tech’s Space Program represents the kind of challenge that excites me: the chance to work on the next generation of flight, where the problems are complex, the solutions are imaginative, and the potential is limitless.
Drones taught me how to look up. Georgia Tech is where I hope to learn how to go farther.
optimizing GPU section 4
Section 4 is about features of the CUDA platform and runtime that you can leverage (memory models, execution control, asynchronous copies, dynamic parallelism, memory management, etc.).
CUDA tiles and easier porting of AI kernels
Writing optimized GPU kernels remains a critical bottleneck.
chatgpt - "The statement by Ion Stoica highlights a critical issue in the field of GPU programming, specifically in the context of optimizing GPU kernels for high performance. Let me break down the reasoning behind why writing optimized GPU kernels remains a critical bottleneck, and what ADRS and LLM-driven search contribute to solving this issue.
GPU kernels are pieces of code that run on a Graphics Processing Unit (GPU) and perform highly parallel computations. Writing these kernels efficiently is difficult because GPUs are designed to perform many operations at once, but optimizing the code to fully leverage the hardware is a non-trivial task.
Memory Access Patterns: GPUs have different memory architectures (e.g., shared memory, global memory, constant memory), and the way memory is accessed can have a huge impact on performance. Optimizing how data is loaded, stored, and accessed is critical, but it’s hard to balance between memory access latency and computation speed.
Parallelism and Load Balancing: Efficiently distributing work across the thousands of cores on modern GPUs while avoiding bottlenecks and ensuring load balancing is tricky. Writing these kernels requires deep knowledge of the hardware architecture.
Precision and Resource Constraints: Optimizing the use of floating-point precision, register usage, and thread synchronization adds another layer of complexity.
All these factors contribute to why writing high-performance GPU kernels is still a significant bottleneck in achieving optimal speed and efficiency in computational tasks.
ADRS (Automated Deep Reinforcement Search) is likely a framework designed to help automate the process of tuning and optimizing GPU kernel code. ADRS uses advanced techniques such as search algorithms (in this case, likely LLM-driven search) to explore possible kernel configurations and optimize them automatically.
LLM-driven search refers to using Large Language Models (LLMs) like GPT or other AI models to search through and generate optimized code snippets. By leveraging the power of AI, the ADRS framework can automate the time-consuming and error-prone process of kernel optimization.
The framework might involve machine learning models that learn the best practices for optimizing kernel code, based on prior kernel codebases or real-time feedback.
According to Stoica's post, the BitsEvolve framework, powered by ADRS, is able to achieve a 1.6x speedup over compiled baselines. This means that the framework, likely through its optimization algorithms, has been able to identify optimizations that improve performance by 60% compared to standard compilation without specific GPU kernel optimizations.
This is a significant gain in performance, especially considering the difficulty of manually tuning GPU kernels. It suggests that AI-powered search and optimization tools like ADRS can accelerate the development process and help overcome the barriers that make GPU kernel optimization so challenging.
The integration of LLMs (Large Language Models) for kernel optimization is a fascinating development. LLMs have been shown to be highly effective at tasks such as natural language understanding and code generation (e.g., writing Python, C++, or CUDA code).
Search for optimal code: By training LLMs on vast codebases, these models can learn patterns of optimized code and leverage this knowledge to suggest improvements or automatically generate code that performs well under specific conditions.
Intelligent search: Rather than relying on brute-force testing of every potential kernel configuration, LLMs can intelligently search for the most efficient kernel patterns, potentially identifying performance bottlenecks and suggesting optimizations that human developers might not think of.
The reason writing optimized GPU kernels remains a critical bottleneck is due to the complexity and specialized knowledge required to fully leverage the capabilities of modern GPUs. Optimizing for things like parallelism, memory access, and resource constraints is a difficult and error-prone task.
However, frameworks like ADRS, using LLM-driven search to automatically generate optimized code, represent a potential breakthrough in overcoming this bottleneck. By automating kernel optimization, these AI-driven solutions can save developers significant time and effort, while potentially achieving superior performance (e.g., the 1.6x speedup mentioned).
This shift towards AI-assisted development tools is a reflection of broader trends in machine learning and AI-powered software engineering—where AI is helping developers not just with writing code but with optimizing and improving code in ways that were previously time-consuming or beyond the capability of traditional methods."