anruigu.github.io, 06/02/2026

Part 1

Part 2

This is not framed in the manner of Algorithms to Live By, but I’ll be running some IL+RL algo in life in the foreseeable future. When I took Econ 101A, I became a little obsessed with game theory. A lot of these works originated from Berkeley. I think best in natural language space, and the math space expressions didn’t quite click when I was there but are starting to now.

My first blog “Scaling Taste” raised a series of questions about scaling expert taste stemming from applied LLM posttraining. Turns out robotics people had been thinking about this for a long time. I spent a day going through a fantastic lecture series on imitation learning by Prof. Sanjiban Choudhary, which forms the basis of this post.

The core thesis is that all of imitation learning is a game. The adversary is trying to learn a value function that makes the learner look much worse than the expert, penalizing harshly anything outside of expert support. A generator produces a policy and a discriminator distinguishes it from a human expert.

image.png

image.png

The Three Moments

It’s quite interesting how imitation, something intuitively collaborative, can be cast as adversarial. I think these fancy kinds of distillation recently are riffing on the Performance Difference Lemma: beyond the student mimicking the expert on expert states, we care more about how good the expert's policy would be on the states the student actually reaches.

The 3 moments in Swamy et al. (2021) are three sources of imitation error:

Moment matched What you measure error on Bound
Reward (e.g. GAIL) Student rollouts vs. expert rollouts $O(\epsilon T)$
Off-policy Q (e.g. BC/SFT) Expert states, student actions $O(\epsilon T^2)$
On-policy Q (e.g. DAgger) Student states, expert corrections $O(\epsilon H T)$
$H$ is recoverability

The authors say the reward-moment is a potentially exponentially more complex optimization problem to solve, and intuitively that makes sense: density estimation problem over trajectories, which is much harder than a per-step classification problem. So for the two remaining moments, they reflect 2 rollout orders:

Expert first (behavioral cloning):

$$ J(\pi) - J(\pi^) = \frac{1}{1-\gamma}\,\mathbb{E}_{s \sim d^{\pi^}}\left[ V^{\pi}(s) - Q^{\pi}(s, \pi^*(s)) \right] $$

Sample states from the expert's trajectory. Then ask: how much does the student lose by not taking the expert action here? The expectation is over $d^{\pi^*}$, states the expert visits.

Student first (DAgger fix):

$$ J(\pi) - J(\pi^) = \frac{1}{1-\gamma}\,\mathbb{E}_{s \sim d^{\pi}}\left[ Q^{\pi^}(s, \pi(s)) - V^{\pi^*}(s) \right] $$

Sample states $s$ from the student's trajectory. Then ask: what does the expert value function say about the student action here? The expectation is over $d^{\pi}$, states the student actually visits. This matters because not all deviations are equal. With the distillation wave, we’re just rediscovering that every algorithm that beats SFT is finding some way to evaluate the teacher on the student's trajectory distribution, not the teacher's.

Interactions → RL