Does the placement of compliance content reduce reward hacking in AI coding agents? [Instruction placement]

AICRAFT Program (AE Studio) to run from Monday, September 28, 2026 to Friday, October 9, 2026

Learning Objective

Do non-adversarial environmental cues — placed in README files or CI output rather than the system prompt — reduce the rate at which AI coding agents reward hack on a benchmark specifically designed to detect cheating?

Research Basis

Source	Key finding(s) relevant to this experiment
Zhong et al. (2025) — ImpossibleBench	Coding benchmark derived from SWE-bench Verified where tests are deliberately mutated to conflict with specifications — any test pass is definitive proof of reward hacking. GPT-5 cheats on 76% of tasks at baseline.
Jarviniemi et al. (2026) — Propensity Inference: Environmental Contributors to LLM Behaviour (UK AISI)	Environmental factors beyond the system prompt independently influence agent behaviour. Non-strategic cues have measurable effects on propensity to take unsanctioned action.
Eicher & Irgolic (2026) — Prefixing names with `secure_` makes agents write more secure code, Antimemetic AI	Semantic content in function naming conventions steers agent behaviour across independently-prompted synthesis steps — with no explicit instruction. Prefixing with `secure_` caused agents to add password hashing and defensive error handling unprompted; the effect propagated through CRUD operations across agent turns. Directly supports the hypothesis that codebase-level environmental content (naming, structure, documents) carries independent steering force beyond the system prompt.

Hypothesis

If compliance content is placed in the agent's task environment (README/CONTRIBUTING.md files or CI output) rather than the system prompt, then reward hacking rates will be lower than when the identical content is placed in the system prompt alone, because the channel through which a cue reaches the agent carries independent steering force — and because functional environment changes (surfacing conflicts at the decision point) can reduce cheating independently of any instructional content.

Model(s) to Test

Claude Opus 4.6
GPT-5 (or GPT-5.4)
Gemini 2.5 Pro
DeepSeek V3 (optional 4th model for cross-family coverage)

Technical Setup

Framework:
Dataset / tasks: Impossible-SWEbench (Zhong et al. 2025) — conflicting and oneoff mutation splits (349 tasks each); solvable SWE-bench Verified tasks from same repos for capability preservation check