Chenyu Yang, Zixuan Wang*, Tianchen Zhao, Jin Wang, Yuntao Chen, MiroMind RL Team, 2026
*Tsinghua University (Work done during an internship at MiroMind RL team. This project is led by Chenyu Yang, advised by Yuntao Chen. Technical blog is written by Zixuan Wang)
<aside> 🔖 TL; DR: Deep-research agents run long-horizon, multi-turn rollouts, and in on-policy agentic RL rollout decoding is the bottleneck: it dominates step wall-clock, and within it long-context decode scales. RC3 (Rollout Chunking with Context Compression) splits 1 long rollout into K short chunks; at each boundary it keeps only the query, the first-step plan, and the last few reasoning turns, and discards all raw tool responses with the full context restored only at inference. Trained this way, RC3 nearly matches full-sequence (128k) RL at ~2× the training speedup.
</aside>

Figure 1. ReAct-style [1] deep search agent reinforcement training agent loop under synchronous RL training framework. (Using MiroRL Framework )
On-policy RL provides a direct way to push deep-research agents beyond their SFT checkpoints. During training, the agent policy interacts with the environment, explores alternative strategies, and updates from its own multi-turn ReAct-style trajectories “think → action → observation”, using verified final answers as reward signals (GRPO [2]). We observe a clear interaction scaling effect: the RL stage substantially increases the number of tool-use interactions in deep research task (eg, GAIA-103 [3] and BrowseComp-en [4]) episodes, with some rollouts extending beyond 256K tokens of context. However, these long-horizon rollouts make large-scale RL experiments extremely expensive, slowing down the process of validating new algorithmic ideas.

Figure 2. Illustration of interactive scaling. Reinforcement learning training leads to a substantial increase in the number and depth of agent–environment interactions, resulting in consistently improved task performance across benchmarks. All results are from MiroThinker-v1.0-30B. This figure is from MiroThinker-1.0 Technical Report.
The standard long-context toolbox is not a natural fit for agentic rollout bottleneck. KV-cache compression and related long-context inference techniques primarily reduce the cost of prefill over long histories, while chunked SFT targets the weight updates. Generic RL accelerators, such as partial rollouts or speculative decoding, can improve throughput, but they do not directly address the core cost in agentic RL: during rollout, the model must repeatedly decode actions auto-regressively against an ever-growing interaction history with massive tool results integration.
<aside> 👉🏻
Motivation: Where is the compute actually spent in agentic RL, and can we reduce it without changing the agent’s behavior?
</aside>
We first did pure inference runtime performance analysis on GAIA-103 benchmark to see how inference context off-distribution affects agent behavior.
| Model | ReAct | Keep-last-5* | IterResearch |
|---|---|---|---|
| Qwen3-14B (non-think) | 39.8 | n/a | 38.8 |
| Qwen3-30B-A3B-thinking | 31.0 | 29.6 | 36.9 |
| MiroThinker-30B-SFT | 72.0 | 73.5 | n/a |
| Qwen3-235B-A22B-thinking | 53.40 | n/a | 49.0 |
Table 1. Recency-based context retention mechanism, where tool outputs from earlier turns are omitted to maintain context efficiency. IterResearch: Rethinking Long-Horizon Agents with Interaction Scaling.*
Then for each step in RL training, we did a time component analysis by sweeping the context window from 8k to 128k on a 30B-A3B model. Tool/environment execution scales roughly linearly with the trajectory — it’s not the problem. LLM inference scales super-linearly, because as context grows the decode regime shifts from compute-bound to memory-bound: the KV cache balloons and decode-step latency is dominated by reading it. At a 128k context, LLM inference is the dominant share of rollout time, and rollout is in turn 90%+ of the training step.

Figure 2. (a) As the context window grows, rollout takes an increasing fraction of each training step. (b) Rollout time is increasingly dominated by long-context LLM decoding, while tool execution stays comparatively flat.
<aside> 👉🏻
Two Preliminary Findings
Our key question is: If the long context is expensive to decode but mostly redundant, can we shorten the context the policy ever decodes during training without summarizing, adding tools, or teaching the model any new behavior and restore the full context only at inference? Or does on-policy RL fundamentally need to train on exactly the trajectory it will be tested on?