Multi-hop QA Synthesis for Large Scale Deep Research Agent

ZIXUAN WANG*, 2026

<aside> 📌

This is a personal technical writeup. Work done during an internship on MiroMind RL team. Technical writeup only for personal learning

</aside>

Figure 1: MiroVerse data engine. Two synthesis tracks: curated public datasets passed through a quality filter and a verifiability check, and raw data lifted into a concept graph and expanded by a data engine, converge on an agent whose trajectories are kept only when they pass a verifier.

Figure 1: MiroVerse data engine. Two synthesis tracks: curated public datasets passed through a quality filter and a verifiability check, and raw data lifted into a concept graph and expanded by a data engine, converge on an agent whose trajectories are kept only when they pass a verifier.

<aside> 🧭

TL;DR

A deep‑research agent's capability ceiling is set by the questions it trains on, not just by the RL algorithm. The hard, under‑appreciated problem is synthesizing multi‑hop questions that are simultaneously (i) genuinely multi‑hop (no single‑retrieval shortcut) (ii) verifiable (a unique, checkable gold answer) (iii) difficulty‑controllable (iv) diverse and cheaply scalable

Pillar 1: Formalization‑driven synthesis (WebShaper‑style): model the task as a formal set‑algebraic expression $\textcolor{blue}{Q}$ over Knowledge Projections on an entity–relation graph $\textcolor{blue}{\mathcal{G}}$, then realize it into language so the required reasoning is isomorphic to the intended structure.
Pillar 2: Difficulty as uncertainty (WebSailor‑style): build a dense graph by random walks, then obfuscate entity references (replace a name with a definite description) to raise the agent's answer uncertainty. </aside>

Color convention

This blog is about data synthesis, so the contrast that matters is between the structure the synthesizer controls at generation time and the surface the agent sees at solve time.

color	meaning	objects
$\textcolor{blue}{\text{blue}}$	latent formal structure the synthesizer controls at generation time	graph $\textcolor{blue}{\mathcal{G}}$, projection $\textcolor{blue}{\mathcal{K}}$, expression $\textcolor{blue}{Q}$, gold set $\textcolor{blue}{A^\star}$
$\textcolor{red}{\text{red}}$	surface objects	question $\textcolor{red}{q}$, trajectory $\textcolor{red}{\tau}$, prediction $\textcolor{red}{\hat{a}}$
$\textcolor{green}{\text{green}}$	verification that certifies an item	uniqueness $\textcolor{green}{\lvert A^\star\rvert=1}$, answerability, no‑shortcutA good synthesis pipeline makes the $\textcolor{red}{\text{red}}$ question faithfully encode the $\textcolor{blue}{\text{blue}}$ structure, and the $\textcolor{green}{\text{green}}$ bridge certifies that it does. Difficulty is engineered in blue; capability is measured in red; the contract between them is green.

1 · Why QA data matters

Open deep‑research agents have scaled along two visible axes: model size and context length. MiroMind's tech report (arXiv:2511.11793) names a third: interaction depth, up to ~600 tool calls inside a 256K‑token context, and shows accuracy rising monotonically as the tool‑call budget grows (their Fig. 5). But all three axes are bottlenecked by the same thing: the questions in the training set. A question solvable by one search teaches a single retrieval, no matter how long the context or how many GRPO steps you run on it. A shortcut‑solvable question never teaches a 40‑hop search. So the real lever is the data‑generation process. Concretely, MiroMind's data engine (Figure 1) fuses two synthesis tracks: curated public data passed through a quality filter and a verifiability check, and raw data lifted into a concept graph and expanded by a data engine, into an agent that runs full trajectories and keeps only the ones a Success/Fail verifier accepts. The resulting traces are overwhelmingly web‑grounded. Reading the shares off Figure 1's pie chart (which displays percentages; the integer counts below are derived against the 602,179 total tool calls):

tool	calls	share
Google Search	251,102	41.7%
Web Scraping	177,053	29.4%
Search & Browse	112,485	18.7%
Python Code	43,124	7.2%
Create Sandbox	11,662	1.9%
Run Command	1,889	0.3%
Others	4,864	0.8%

Genuinely multi‑hop.
Verifiable.
Difficulty‑controllable.
Diverse and cheaply scalable.

<aside>

Principle. Properties (i)–(iv) are jointly achievable only when you fix the latent structure first (blue) and certify the item with a verifier (green) — then let the surface text (red) be a faithful encoding of that structure. Text‑first generation hopes the structure emerges; structure‑first guarantees it.

</aside>

1.1 · The villain: text‑first / passage‑to‑question

The naive baseline shows an LLM a passage and asks it to "write a hard multi‑hop question." It controls surface fluency and topical relevance, and essentially nothing about structure or answer.

Structure ≠ reasoning mismatch. WebShaper's central complaint is the "inconsistency between information structure and reasoning structure, question and answer." The passage's information layout need not match the reasoning the question actually demands.
One‑search shortcuts. The gold answer often sits a single retrieval away: the "multi‑hop" question collapses to one hop.