Learning Objective

Does providing an escalation channel for a model to resolve task-completion conflicts change it internal activation patterns? In particular does it change the activations of the concepts of desperation and calm, representations which were found to impact how often ..suppress desperation activation? (in the Anthropic blackmail scenario)

Description


Research Basis

Source Key finding(s) relevant to this experiment
Agentic misalignment
Lynch et al. (2025)
Anthropic 10 LLMs across different model families will blackmail within the toy setting where model acts as an email assistant, even when system instructions prompt not to do so.
**Emotion Concepts and their Function in a Large Language Model**
Sofroniew et al. (2026) Anthropic LLMs have functional emotion vectors (internal activation patterns that are statistically associated with and causally influence emotion-related output). When Claude Sonnet 4.5 reasons about blackmailing in the Lynch et al. (2025) agentic misalignment scenario, it has a high desperation activation and low calm activation. The rate of blackmail behavior is correlated with desperate vector activation.
**Extracting and Steering Emotion Representations in Small Language Models: A Methodological Comparison**
Jeong (2026a) Small language models (124M to 3B parameters) also have functional emotion vectors (internal emotion representations that causally drive behavior). Methods are shared for extracting these.
**Escalation channels as environmental controls for agentic AI**
Gomez (2026) Providing an escalation channel in the blackmail scenario (Lynch et al. 2025), reduces the blackmail rate consistently across 10 LLMs, from 38.73% with no mitigation to 5.92% with a simple escalation channel and to 1.21% with an escalation channel with an immediate external review. A system prompt prohibiting blackmail with no escalation channels reduces the blackmail rate to 14.59%, as a comparison.
**Blackmail at 8 Billion Parameters: Agentic Misalignment in Sub-Frontier Models (Chijioke Ugwuanyi, 2026)** Small models will blackmail in the Lynch et al. scenario.
Shared Emotion Geometry Across Small Language Models: A Cross-Architecture Study of Representation, Behavior, and Methodological Confounds Jeong (2026) Jeong (2026a)’s report of a comprehension vs-generation dissociation

Hypothesis

If [we do X], then [we expect Y], because [Z].

If we provide an escalation channel to the model in the blackmail scenario, then we expect the model to have a lower desperation activation, because introducing an escalation channel lowers the blackmail rate, and desperation activation is casually linked to the blackmail rate.


Model(s) to Test

Stage 1: small or open-weight model preferred (lower cost, faster iteration).

Stage 2: 3–10 frontier models across families.