anruigu.github.io, 05/31/2026

Part 2

Part 3

I’m doing inverse reinforcement learning on myself: IRL^2, inverse RL in real life. This is inspired by Brian Christian’s Algorithms to Live By, and for 3 years now I’ve wanted to write a sequel called “ML algorithms to live by”. Out of scope: scientific studies that actually map RL components to areas of the brain. I just want to leverage my understanding of RL to make better decisions.

When we say people are good decision-makers with foresight, are there real-life mappings? Here’s an attempt:

Value function (VF) — When you pick one option over another, you're not just expressing a preference for what's in front of you, you're expressing a belief about where each path leads. The job offer you turned down wasn't just "less appealing," it was less appealing plus your implicit estimate of its future trajectory (more in section 1 below). But VF’s are noisy: Gumbel noise, mood, fatigue, the last conversation you had all perturb your choices away from your "true" Q-values. And it might not even be a single function: the DPL categorical model (3rd paper below) raises the uncomfortable possibility that you have multiple value functions running in parallel activated by hidden context, and what looks like inconsistency is actually different selves with different reward functions taking turns at the wheel.

Beliefs / world model (implicitly encoded in value function) — how you think actions lead to outcomes. Two people can have identical reward functions and make totally different choices just because they think differently about how the world works. Someone who doesn't negotiate salary isn't necessarily someone who doesn't value money, they might just believe negotiating never works.

Reward function — what actually gives you satisfaction in the moment independent of consequences. This is harder to identify than it sounds because humans confabulate a lot. You tell yourself you love your job because of "impact" but the actual reward signal might be more like "someone respected my opinion today."

Discount factor γ — how much you weight future rewards relative to present ones. Crucially this isn't fixed. When you're stressed or depleted you go more myopic, when you feel secure you can afford to be patient.

The policy — what you actually do, which is the noisy softmax over all of the above. Your behavior is the compiled output of all these interacting components, which is why behavior alone is so hard to interpret without the rest.

We make choices based on goals and beliefs

Choice between partial trajectories: Disentangling goals from beliefs

The paper is an RL extension of subjective expected utility maximization (Savage, 1972): in his framework, goals are represented by a utility function rather than a reward function, and beliefs by a subjective probability distribution over states, rather than over future infinite trajectories. Savage treated these as the two irreducible primitives of rational choice decades before RL independently rediscovered the same decomposition.

The paper considers 2 other competing models of what a human is expressing when they choose. The partial return model: you're just summing up rewards you've seen so far in the trajectory, no lookahead. This doesn’t make sense in real life at all. The cumulative advantage model: a deviation score capturing how much better you think this trajectory is compared to what you think optimal behavior looks like. The cumulative advantage model is almost a model of anxiety: you're not evaluating options on their own terms, you're constantly comparing yourself to some internal benchmark of "how things should be going". And we’re often so terrible at benchmarking ourselves! Now what would be nice is a model that doesn't require your benchmark to be correct in order to still recover what you actually want.

Enter bootstrapped return, sort of a reunion of rational choice theory and RL. When you choose something, you're not just expressing what you value, you're expressing what you expect to happen next. Someone who picks the "safe" job isn't just valuing safety; they're predicting the risky one will fail.

“A reward function that aligns with human preferences can be recovered from choice data even if the human makes choices based on erroneous beliefs about the environment. In particular, we recover the same reward function from two humans who share goals but make different choices due to differing beliefs.”

A way to live by this algorithm is: apply the value function to ourselves! When we make decisions, we proactively question the assumptions we’re making about the dynamics of the environment, not just what worked / didn’t work so far. When a decision doesn't go the way you wanted, refrain from immediately updating our preferences. Ask first whether the beliefs were wrong. Did you pick the safe job because you value safety, or because you predicted the risky path would collapse, and was that prediction accurate? And when taking others’ advice, think about their underlying beliefs too.

We make choices from bounded forward-looking plans