ZIXUAN WANG*, 2026
<aside> 📌
This is a personal technical writeup. Work done during an internship on MiroMind RL team. Technical writeup only for personal learning
</aside>

Figure 1: MiroVerse data engine. Two synthesis tracks: curated public datasets passed through a quality filter and a verifiability check, and raw data lifted into a concept graph and expanded by a data engine, converge on an agent whose trajectories are kept only when they pass a verifier.
<aside> đź§
TL;DR
A deep‑research agent's capability ceiling is set by the questions it trains on, not just by the RL algorithm. The hard, under‑appreciated problem is synthesizing multi‑hop questions that are simultaneously (i) genuinely multi‑hop (no single‑retrieval shortcut) (ii) verifiable (a unique, checkable gold answer) (iii) difficulty‑controllable (iv) diverse and cheaply scalable
This blog is about data synthesis, so the contrast that matters is between the structure the synthesizer controls at generation time and the surface the agent sees at solve time.
| color | meaning | objects |
|---|---|---|
| $\textcolor{blue}{\text{blue}}$ | latent formal structure the synthesizer controls at generation time | graph $\textcolor{blue}{\mathcal{G}}$, projection $\textcolor{blue}{\mathcal{K}}$, expression $\textcolor{blue}{Q}$, gold set $\textcolor{blue}{A^\star}$ |
| $\textcolor{red}{\text{red}}$ | surface objects | question $\textcolor{red}{q}$, trajectory $\textcolor{red}{\tau}$, prediction $\textcolor{red}{\hat{a}}$ |
| $\textcolor{green}{\text{green}}$ | verification that certifies an item | uniqueness $\textcolor{green}{\lvert A^\star\rvert=1}$, answerability, no‑shortcutA good synthesis pipeline makes the $\textcolor{red}{\text{red}}$ question faithfully encode the $\textcolor{blue}{\text{blue}}$ structure, and the $\textcolor{green}{\text{green}}$ bridge certifies that it does. Difficulty is engineered in blue; capability is measured in red; the contract between them is green. |
Open deep‑research agents have scaled along two visible axes: model size and context length. MiroMind's tech report (arXiv:2511.11793) names a third: interaction depth, up to ~600 tool calls inside a 256K‑token context, and shows accuracy rising monotonically as the tool‑call budget grows (their Fig. 5). But all three axes are bottlenecked by the same thing: the questions in the training set. A question solvable by one search teaches a single retrieval, no matter how long the context or how many GRPO steps you run on it. A shortcut‑solvable question never teaches a 40‑hop search. So the real lever is the data‑generation process. Concretely, MiroMind's data engine (Figure 1) fuses two synthesis tracks: curated public data passed through a quality filter and a verifiability check, and raw data lifted into a concept graph and expanded by a data engine, into an agent that runs full trajectories and keeps only the ones a Success/Fail verifier accepts. The resulting traces are overwhelmingly web‑grounded. Reading the shares off Figure 1's pie chart (which displays percentages; the integer counts below are derived against the 602,179 total tool calls):
| tool | calls | share |
|---|---|---|
| Google Search | 251,102 | 41.7% |
| Web Scraping | 177,053 | 29.4% |
| Search & Browse | 112,485 | 18.7% |
| Python Code | 43,124 | 7.2% |
| Create Sandbox | 11,662 | 1.9% |
| Run Command | 1,889 | 0.3% |
| Others | 4,864 | 0.8% |
<aside>
Principle. Properties (i)–(iv) are jointly achievable only when you fix the latent structure first (blue) and certify the item with a verifier (green) — then let the surface text (red) be a faithful encoding of that structure. Text‑first generation hopes the structure emerges; structure‑first guarantees it.
</aside>
The naive baseline shows an LLM a passage and asks it to "write a hard multi‑hop question." It controls surface fluency and topical relevance, and essentially nothing about structure or answer.