Tuesday, July 15, 2025

Missing learning loop for LLMs

 Andrej's take on RL with chatgpt magic. How humans understand touches on the need for this elaboration.

1. RL is powerful, but not the full story

  • RL is gaining traction and will continue to generate useful results, particularly because it’s more leveraged than traditional supervised fine-tuning (SFT). But it has limitations—particularly with long-horizon tasks (tasks that take a long time or many steps).

  • The standard RL approach—rewarding or punishing actions based on final scalar feedback—is very lossy, especially when the task is long and complex.

“You're really going to do all that work just to learn a single scalar outcome at the very end?”


2. Human learning isn’t like that

  • Humans don't just get a reward at the end; they reflect. After doing something, we think:

    • What went well?

    • What didn’t?

    • What could I do differently?

  • These explicit lessons are stored consciously, sometimes even verbalized ("next time, try X"), and over time they become intuitive or second nature.

  • LLMs are missing this kind of reflective, in-context learning + distillation pipeline.

“There's significantly more bits of supervision we extract per rollout via a review/reflect stage.”


3. He proposes an example paradigm

  • Run a few episodes of task attempts (rollouts).

  • Feed those attempts and outcomes into a model.

  • Use a prompt to analyze them and extract explicit lessons.

  • Save those lessons in a "lessons database", like adding to a system prompt.

  • Over time, this can be distilled into weights—just like sleep does in human learning.

“This string is the 'lesson', explicitly instructing the model how to complete the counting task.”


4. This doesn't yet exist in traditional RL domains

  • In environments like Atari or robotics, there are no language models, no system prompts, no reflection. That limits how you can apply this idea there.

  • But for LLMs, we can do this—and we should, because they support meta-learning, in-context memory, and text-based abstraction.


5. TL;DR takeaway

  • RL > SFT, but both are still missing something.

  • Reflection + explicit lessons → distillation → intuitive behavior is a missing learning loop.

  • This could lead to new S-curves of progress, especially unique to LLM architectures, not shared by traditional RL agents in games/robots.


🧠 Meta-message

Andrej is advocating for new learning paradigms that:

  • Better mimic how humans learn.

  • Use the unique affordances of LLMs (text, reflection, context windows).

  • Go beyond vanilla RL and SFT, especially as tasks get longer and more complex.

He's hinting that the next big breakthroughs might not just come from scaling, but from rethinking how we teach models to reflect, remember, and internalize lessons.


No comments: