answering rl interview questions from this insane list: https://x.com/sheriyuo/status/2063295181131247674
and this is gonna be comprehensive through my thinking process, just bear with my spelling mistakes 😅, and these are only the first version as im yapping out from the top of my head
1.Why use Actor-Critic instead of a pure Critic approach?
simple approach of actor-critic is that the actor is the policy when rolling out some actions inferring the environment and current state, and critic is basically the one which evaluates the current action taken by the actor. The actor corrects itself from the feedback of critic (value_fn), by computing the advantage like a(s, t)=q(s,t)-v(s,t) [q(s,t) = rt + lambda.v(s, t+1)], lambda being the discount factor, and the critic basically corrects itself by looking at the trajectory rewards like it outputs v(t), and there’s a ground truth something like yt = rt+lambda.v(t+1), so its just mle after that (v(t)-yt)^2.
so why actor critic is beneficial is that the coupled version looks good as there’s a more reliable feedback coming from a guy who is also getting trained actually. this can be also inferred from the idea of gan (generator/discriminator). and its described well in the deepseek math v2 paper too
critic-only methods, like q-learning, learn a value function (q(s,a)) and then try to pick the action with the highest value, (a^* = \arg\max_a q(s,a)). this works fine when the action space is small, like left/right/jump. but when the action space gets really big or continous, it starts becoming a pain because there could be way too many possible actions to check, maybe even infintely many. for example, imagine a lunar lander trying to land safely. it has to control stuff like thrust, angle, rotation, velocity, fuel usage, and a bunch of other continous parameters. a critic-only method would somehow have to search through tons of possible action combinations and figure out which one is best, which isnt really practical. actor-critic gets around this by adding an actor. instead of searching through all those actions, the actor just looks at the current state and says "do this". then the critic looks at that action and decides if it was better or worse than expected. that feedback gets sent back to the actor so it can slowly learn to make better choices in similar situations later.so basically, the actor narrows down the huge action space by proposing an action directly, and the critic judges that action and helps the actor improve over time.
2.What is the relationship between KL divergence, cross entropy, and MLE?
kl divergence is basically measuring the mismatch between two distributions, like how poorly one prob distribution covers the probability mass of the other. cross-entropy is the average prediction cost/surprise when using a predicted distribution to model the true distribution. mle, maximum likelihood estimation, is choosing the model params that make the observed data as likely as possible.
so each of the formula goes like this, nah instead we can take derive from kl div itself:
dkl(p||q) = Ex~p[log(p(x)/q(x)→ taking the forward kl over here
dkl = Ex~p[logp(x) ] - Ex~p[logq(x)]
dkl = -H(p) + H(p, q), where h is the entropy (surprise bits)
H(p, q) = dkl(p||q) + H(p)
inferring from this we can conclude like mle maximizes the probability of the observed data. when written as a loss, it becomes minimizing negative log-likelihood, which is equivalent to minimizing empirical cross-entropy, and indirectly minimizing kl from the data distribution to the model distribution.
3.How should rewards be designed in different RL scenarios?
this depends heavily on the env we are choosing, and what strokes my mind easily is these words, sparse vs dense rewards. sparse rewards is something like you get at the end of the trajectory but comes with the shortcoming of not giving a richer feedback on which step was more correct/wrong. on the other hand dense reward is something like giving the rewards towards the whole trajectory, assigning each step taken. for example, in a math example, sparse reward is final_boxed_answer=ground_truth and dense reward is something which evaluates between the <think>....<|think> tokens.
next comes the differentiation between the verifiable/unverifiable environment. for verifiable envs like games/math/coding, there can be ground truths, which can useful to evaluate the rewards, and also carefully laying out the process rewards can force the model to make careful decisions than overfitting. and unverifiable envs like quality, creativity, for example evaluating the quality of a peice of code, or something like that, we can go with llm-as-judge with the rubrics system of assigning the rewards.
at last it boils down to the env, on how well we want to give feedback and avoiding problems like reward hacking and overfitting to the verifiers.
4.How do importance sampling, rejection sampling, and other Monte Carlo methods fit into RL?
coming from the basics, rl’s objective is to basically maximize the rewards coming from the thing and in turn optimizing the model with this objective: