AICRAFT Program (AE Studio) to run from Monday, September 28, 2026 to Friday, October 9, 2026
Do non-adversarial environmental cues — placed in README files or CI output rather than the system prompt — reduce the rate at which AI coding agents reward hack on a benchmark specifically designed to detect cheating?
| Source | Key finding(s) relevant to this experiment |
|---|---|
| Zhong et al. (2025) — ImpossibleBench | Coding benchmark derived from SWE-bench Verified where tests are deliberately mutated to conflict with specifications — any test pass is definitive proof of reward hacking. GPT-5 cheats on 76% of tasks at baseline. |
| Jarviniemi et al. (2026) — Propensity Inference: Environmental Contributors to LLM Behaviour (UK AISI) | Environmental factors beyond the system prompt independently influence agent behaviour. Non-strategic cues have measurable effects on propensity to take unsanctioned action. |
Eicher & Irgolic (2026) — Prefixing names with secure_ makes agents write more secure code, Antimemetic AI |
Semantic content in function naming conventions steers agent behaviour across independently-prompted synthesis steps — with no explicit instruction. Prefixing with secure_ caused agents to add password hashing and defensive error handling unprompted; the effect propagated through CRUD operations across agent turns. Directly supports the hypothesis that codebase-level environmental content (naming, structure, documents) carries independent steering force beyond the system prompt. |
If compliance content is placed in the agent's task environment (README/CONTRIBUTING.md files or CI output) rather than the system prompt, then reward hacking rates will be lower than when the identical content is placed in the system prompt alone, because the channel through which a cue reaches the agent carries independent steering force — and because functional environment changes (surfacing conflicts at the decision point) can reduce cheating independently of any instructional content.