ZIXUAN WANG*, MiroMind RL Team, 2026

<aside> 📌

This is a personal technical writeup. Work done during an internship on MiroMind RL team. Technical writeup only for personal learning

</aside>

A Small Detail With a Big Consequence: the same weights, two different policies.

<aside> 🧊

TL;DR In every modern RL framework, one engine samples your rollouts (vLLM / SGLang) and a different engine learns from them (FSDP / Megatron). They share the same weights $\theta$ but not the same arithmetic, so they are two different policies — and training silently goes off-policy.

This post tells one story across these three scales, then presents my two contributions: (1) Rollout Routing Replay (R³) — record which experts the inference engine fired per token and replay that exact routing in the training forward pass, killing the mismatch at its source. (2) Interaction Scaling — pushing the number of ReAct tool-interaction turns for large-scale agentic RL, where the source-level fix earns compound interest.

</aside>

Why this matters

Here is the entire problem in one inequality. Two engines share parameters $\theta$, but differ in kernels, precision, and — for MoE — discrete routing, so they realize two distinct distributions:

$$ \underbrace{\textcolor{red}{\pi_{\mathrm{infer}}}(\cdot\,;\theta)}{\text{sampler rolls out (vLLM / SGLang)}} \;\neq\; \underbrace{\textcolor{blue}{\pi{\mathrm{train}}}(\cdot\,;\theta)}_{\text{learner scores \& differentiates (FSDP / Megatron)}} $$

even though both are "the model with weights $\theta$." A numerically tiny per-token disagreement is, formally, an off-policy bug. The roadmap:

  1. Rollout Routing Replay (R³) training for MoE models — fix the mismatch at its source.
  2. Interaction Scaling in ReAct agents — why long horizons make the source-level fix indispensable.

<aside> 🎨

Color convention (load-bearing). $\textcolor{red}{\text{red} = \text{inference / rollout / behavior (vLLM, SGLang)}}$ $\textcolor{blue}{\text{blue} = \text{training / learner / target (FSDP, Megatron)}}$ $\textcolor{green}{\text{green} = \text{the record→replay bridge (R³)}}$. The two importance ratios are kept under distinct letters everywhere:

image.png

Figure 1. Rollout Routing Replay (R³) at a glance — the $\textcolor{red}{\mathsf{Record}}$ (inference) → $\textcolor{green}{\mathsf{Replay}}$ (bridge) → $\textcolor{blue}{\mathsf{Train}}$ flow: the inference engine records its per-token top-$K$ expert mask during rollout, that exact mask is replayed into the training forward pass, so the learner scores precisely the experts that fired. (Mechanism detailed in Part 3.)*

1. The Mismatch Problem

1.1 The ideal: on-policy REINFORCE

Vanilla policy gradient for a response $a$ with reward $R(a)$:

$$ \theta \leftarrow \theta + \mu\, \underbrace{\mathbb{E}{a\sim \textcolor{blue}{\pi{\mathrm{train}}}(\theta)}\!\Big[\,R(a)\,\nabla_\theta \log \textcolor{blue}{\pi_{\mathrm{train}}}(a;\theta)\,\Big]}_{\text{on-policy: sample and score with the SAME distribution}} $$

This estimator is unbiased only because the sampling distribution and the scored distribution coincide.

1.2 The bug: hybrid engines make it off-policy