Chenyu Yang, Zixuan Wang*, Tianchen Zhao, Jin Wang, Yuntao Chen, MiroMind RL Team, 2026

*Tsinghua University (Work done during an internship at MiroMind RL team. This project is led by Chenyu Yang, advised by Yuntao Chen. Technical blog is written by Zixuan Wang)

<aside> 🔖 TL; DR: Deep-research agents run long-horizon, multi-turn rollouts, and in on-policy agentic RL rollout decoding is the bottleneck: it dominates step wall-clock, and within it long-context decode scales. RC3 (Rollout Chunking with Context Compression) splits 1 long rollout into K short chunks; at each boundary it keeps only the query, the first-step plan, and the last few reasoning turns, and discards all raw tool responses with the full context restored only at inference. Trained this way, RC3 nearly matches full-sequence (128k) RL at ~2× the training speedup.

</aside>

Screenshot 2026-06-24 at 2.44.38 PM.png

Figure 1. ReAct-style [1] deep search agent reinforcement training agent loop under synchronous RL training framework. (Using MiroRL Framework )

Why this matters

On-policy RL provides a direct way to push deep-research agents beyond their SFT checkpoints. During training, the agent policy interacts with the environment, explores alternative strategies, and updates from its own multi-turn ReAct-style trajectories “think → action → observation”, using verified final answers as reward signals (GRPO [2]). We observe a clear interaction scaling effect: the RL stage substantially increases the number of tool-use interactions in deep research task (eg, GAIA-103 [3] and BrowseComp-en [4]) episodes, with some rollouts extending beyond 256K tokens of context. However, these long-horizon rollouts make large-scale RL experiments extremely expensive, slowing down the process of validating new algorithmic ideas.

Screenshot 2026-06-28 at 5.04.11 PM.png

Figure 2. Illustration of interactive scaling. Reinforcement learning training leads to a substantial increase in the number and depth of agent–environment interactions, resulting in consistently improved task performance across benchmarks. All results are from MiroThinker-v1.0-30B. This figure is from MiroThinker-1.0 Technical Report.

The standard long-context toolbox is not a natural fit for agentic rollout bottleneck. KV-cache compression and related long-context inference techniques primarily reduce the cost of prefill over long histories, while chunked SFT targets the weight updates. Generic RL accelerators, such as partial rollouts or speculative decoding, can improve throughput, but they do not directly address the core cost in agentic RL: during rollout, the model must repeatedly decode actions auto-regressively against an ever-growing interaction history with massive tool results integration.

<aside> 👉🏻

Motivation: Where is the compute actually spent in agentic RL, and can we reduce it without changing the agent’s behavior?

</aside>

Preliminary Study

We first did pure inference runtime performance analysis on GAIA-103 benchmark to see how inference context off-distribution affects agent behavior.

Model ReAct Keep-last-5* IterResearch
Qwen3-14B (non-think) 39.8 n/a 38.8
Qwen3-30B-A3B-thinking 31.0 29.6 36.9
MiroThinker-30B-SFT 72.0 73.5 n/a
Qwen3-235B-A22B-thinking 53.40 n/a 49.0

Table 1. Recency-based context retention mechanism, where tool outputs from earlier turns are omitted to maintain context efficiency. IterResearch: Rethinking Long-Horizon Agents with Interaction Scaling.*

Then for each step in RL training, we did a time component analysis by sweeping the context window from 8k to 128k on a 30B-A3B model. Tool/environment execution scales roughly linearly with the trajectory — it’s not the problem. LLM inference scales super-linearly, because as context grows the decode regime shifts from compute-bound to memory-bound: the KV cache balloons and decode-step latency is dominated by reading it. At a 128k context, LLM inference is the dominant share of rollout time, and rollout is in turn 90%+ of the training step.

Screenshot 2026-06-24 at 2.45.36 PM.png

Figure 2. (a) As the context window grows, rollout takes an increasing fraction of each training step. (b) Rollout time is increasingly dominated by long-context LLM decoding, while tool execution stays comparatively flat.

<aside> 👉🏻

Two Preliminary Findings

  1. The context is mostly disposable. Raw tool responses (web pages, search dumps) are ~70%+ of the tokens and are full of noise, yet the agent’s next decision depends almost entirely on its recent reasoning. The useful content of a tool response has already been re-absorbed into the model’s own subsequent chain-of-thought .
  2. Rollout decode dominates the cost. A breakdown of training wall-clock shows rollout is the large majority of the step, and within rollout the LLM decode, not tool/environment latency. </aside>

Our key question is: If the long context is expensive to decode but mostly redundant, can we shorten the context the policy ever decodes during training without summarizing, adding tools, or teaching the model any new behavior and restore the full context only at inference? Or does on-policy RL fundamentally need to train on exactly the trajectory it will be tested on?