ZIXUAN WANG*, MiroMind RL Team, 2026
<aside> 📌
This is a personal technical writeup. Work done during an internship on MiroMind RL team. Technical writeup only for personal learning
</aside>
A Small Detail With a Big Consequence: the same weights, two different policies.
<aside> 🧊
TL;DR In every modern RL framework, one engine samples your rollouts (vLLM / SGLang) and a different engine learns from them (FSDP / Megatron). They share the same weights $\theta$ but not the same arithmetic, so they are two different policies — and training silently goes off-policy.
This post tells one story across these three scales, then presents my two contributions: (1) Rollout Routing Replay (R³) — record which experts the inference engine fired per token and replay that exact routing in the training forward pass, killing the mismatch at its source. (2) Interaction Scaling — pushing the number of ReAct tool-interaction turns for large-scale agentic RL, where the source-level fix earns compound interest.
</aside>
Here is the entire problem in one inequality. Two engines share parameters $\theta$, but differ in kernels, precision, and — for MoE — discrete routing, so they realize two distinct distributions:
$$ \underbrace{\textcolor{red}{\pi_{\mathrm{infer}}}(\cdot\,;\theta)}{\text{sampler rolls out (vLLM / SGLang)}} \;\neq\; \underbrace{\textcolor{blue}{\pi{\mathrm{train}}}(\cdot\,;\theta)}_{\text{learner scores \& differentiates (FSDP / Megatron)}} $$
even though both are "the model with weights $\theta$." A numerically tiny per-token disagreement is, formally, an off-policy bug. The roadmap:
<aside> 🎨
Color convention (load-bearing). $\textcolor{red}{\text{red} = \text{inference / rollout / behavior (vLLM, SGLang)}}$ $\textcolor{blue}{\text{blue} = \text{training / learner / target (FSDP, Megatron)}}$ $\textcolor{green}{\text{green} = \text{the record→replay bridge (R³)}}$. The two importance ratios are kept under distinct letters everywhere:

Figure 1. Rollout Routing Replay (R³) at a glance — the $\textcolor{red}{\mathsf{Record}}$ (inference) → $\textcolor{green}{\mathsf{Replay}}$ (bridge) → $\textcolor{blue}{\mathsf{Train}}$ flow: the inference engine records its per-token top-$K$ expert mask during rollout, that exact mask is replayed into the training forward pass, so the learner scores precisely the experts that fired. (Mechanism detailed in Part 3.)*
Vanilla policy gradient for a response $a$ with reward $R(a)$:
$$ \theta \leftarrow \theta + \mu\, \underbrace{\mathbb{E}{a\sim \textcolor{blue}{\pi{\mathrm{train}}}(\theta)}\!\Big[\,R(a)\,\nabla_\theta \log \textcolor{blue}{\pi_{\mathrm{train}}}(a;\theta)\,\Big]}_{\text{on-policy: sample and score with the SAME distribution}} $$
This estimator is unbiased only because the sampling distribution and the scored distribution coincide.