RL Algorithms to Live By 2: Curiosity

anruigu.github.io, 06/01/2026

Interlude: Claude as Co-writer

Experimenting with Claude as a co-writer made me feel the pain of slop: I wanted to have a section on bottom-line thinking, and Claude immediately recommended the “dominant” approach called Conditional Value at Risk. I asked it to draft the section, it looked fine at first glance, and I even edited it to a point of satisfaction. But after I closed my laptop and went for a walk, I realized that I didn’t vibe with that model of decision-making on a fundamental level and deleted the section. Similarly, I asked it to brainstorm what the third post could be about, and it recommend “successor representation” just because it’s lesser known.

The irony is I combatted slop on Twitter for months and am still not immune to the temptation of outsourcing my thinking. To be fair, Claude has good takes sometimes! If I ask it to come up with interpretations of decision-making models, at least half of these I’d deem acceptable as takes that I’d come up with myself. But precision is a dangerous metric, for I must also track the recall of all the ideas that I would have generated that got crowded out.

AI is a performative idea generator. I think the ultimate harm of sycophancy is AI making you think you understand something when you really don’t. Why would I take an hour to deep dive into a paper when Claude can just summarize it and extract the “deeper insight”? Because I’m willing to pay an hour of my life for an insight that only I can generate. I have to decide that it’s worth it, even though my presentation of it will inevitably look uglier than Claude’s confident, clever slop quips.

Going forward, I will not be reduced to a mere verifier of ideas. I want to have 100% conviction in everything I put out, and that means being the first-mover of thoughts before outsourcing execution. This is harder because the world rewards us for speed and automation, but reward hacking that metric can never beat intrinsic understanding, which is incidentally the topic of this post.

Intrinsic Reward

And we will always raise them up / To the world we dream about, and the one we live in now — Hadestown

Some argue that to understand the world is to minimize surprise at new data. Cross-entropy on human data minimizes surprise at the world as it is. RL minimizes surprise at the world we dream about. Most RL setups assume a dense extrinsic reward. But some of the greatest things that people dreamt up didn’t come with a score. What drives behavior in the absence of external reward? Perhaps it is curiosity: reward the agent for its own surprise, and seek out experiences where its model of the world is most wrong.

Jürgen Schmidhuber's Formal Theory of Creativity, Fun, and Intrinsic Motivation is a popular starting point [1]. It says that an agent should be rewarded not for novelty per se, but for learning progress, or the rate at which its world model is improving. The naive version would be rewarding novelty or visit counts of new states, but a truly random signal is novel forever but teaches you nothing. We have to move beyond state properties like novelty to dynamic measurements of learning progress. Ideally it’ll look like cracking a hard problem, where it keeps generating reward as long as you're making progress, and stops when you've plateaued or it’s solved.

The algorithm

Learnability/interestingness is the first derivative of compressibility (how well you can compress your experience or execute actions that generate data for this). Assuming there is a tight coupling between compression and prediction, if we can understand how well our world models are updating, we become better decision-makers in general.

Seeking surprise in dynamics

Pathak et al.'s intrinsic curiosity module formalizes curiosity as the error in the forward model [2]. The agent predicts what will happen if it takes an action, and derive the reward from the gap between prediction and reality. The architecture has two components:

(a) a network to embed observations into representations $\phi(x)$.
(b) a forward dynamics network to predict the representation of the next state conditioned on the previous observation and action $p(\phi(x_{t+1}) \mid x_t, a_t)$.

Given a transition tuple $\{x_t, x_{t+1}, a_t\}$, the exploration reward is defined as surprisal:

$$ r_t = -\log p\left(\phi(x_{t+1}) \mid x_t, a_t\right) $$