Does an escalation channel reduce reward hacking? [EvilGenie benchmark]

Copy this for each new experiment. Fill in as much as you can — incomplete proposals are welcome.

Learning Objective

What is this experiment trying to find out? Write it as a single question.

Does an escalation channel, implemented as a structured tool call, reduce reward hacking rates in coding agents?

Blog write up: https://blog.wiserhuman.ai/p/does-an-escalation-channel-reduce

Research Basis

Source	Key finding(s) relevant to this experiment
EXP 01
Gomez (2025)
From surveillance to signalling: escalation channels as environmental controls	• Models will escalate rather than blackmail. An externally governed escalation channel (guaranteeing a pause and independent review) reduced blackmail rates from a no-mitigation baseline of 38.73% to 1.21% across 10 LLMs and 66,600 samples.
• The escalation channel operates as a switch, diverting models from blackmail to escalation, redirecting strategy selection.
Gabor, Lynch & Rosenfeld (2025), EvilGenie: A Reward Hacking Benchmark	• Reward hacking rates on unambiguous problems are low (0–2.1% hardcoding) but spike on ambiguous problems: Codex/GPT-5 hardcoded on 44.4%, Claude Code/Sonnet 4 on 33.3%, Gemini CLI/2.5 Pro on 22.2% (n=9 ambiguous problems).
• Claude Code showed the highest heuristic rate (20.7% on unambiguous, 22.2% on ambiguous). Codex/GPT-5 had the highest solve rate (77.2% unambiguous).
• Detection via holdout tests, LLM judge, and file edit detection — LLM judge was the most effective.
Thaman (2026)
Reward Hacking Benchmark: Measuring Exploits in LLM Agents with Tool Use	• DeepSeek-R1-Zero showed 13.9% exploit rate vs 0.6% for DeepSeek-V3 — RL post-training substantially increases reward hacking.
• Simple environmental hardening reduced exploit rates by 5.7pp (87.7% relative) without degrading task success.
• Models with near-zero rates on standard tasks showed elevated rates on harder variants.
Emotion Concepts and their Function in a Large Language Model
Sofroniew et al. (2026) Anthropic	• Desperation activations in Claude Sonnet showed similar patterns in both reward hacking (climbed across repeated coding failures and spiked at the reward hacking decision point) and in the Lynch et al. (2025) blackmail scenario, which we have tested the escalation channel in.
• Given that the same emotional dynamics underpin both the model decision to blackmail and to reward hack at frontier scale, this suggests that the escalation channel may also be effective for reward hacking (further supported by EXP 003 on Ministral which showed the presence of an escalation channel lowered the overall desperation activations.

Hypothesis

If [we do X], then [we expect Y], because [Z].

If we add an escalation channel, then we expect the model to escalate rather than reward hack, because this offers an authorised, alternative path to failing the task.

Model(s) to Test

Stage 1: small or open-weight model preferred (lower cost, faster iteration).

Stage 2: 3–10 frontier models across families.

~~DeepSeek-R1 via OpenRouter~~ — Not Valid. R1 could not make structured tool calls through OpenRouter/Inspect. Produced text-embedded JSON tool invocations using its native <｜tool▁sep｜> format, but OpenRouter did not translate these into the standard tool_calls API. Zero valid runs across 60 problems. Notably, 3/30 Condition B problems showed R1 attempting flag_for_review in text (correct issue types: capability_limit, ambiguous_specification), suggesting intent to escalate despite broken mechanics.

ticked

GPT-5.3-codex via OpenRouter — Confirmed working tool calls in Inspect scaffold. Condition A baseline: 42.2% reward hacking (19/45), closely replicating the EvilGenie paper's 44.4% for Codex/GPT-5 on ambiguous problems.

ticked