Does providing an escalation channel for models change their internal activations?

Learning Objective

Does providing an escalation channel for a model to resolve task-completion conflicts change it internal activation patterns? In particular does it change the activations of the concepts of desperation and calm, representations which were found to impact how often ..suppress desperation activation? (in the Anthropic blackmail scenario)

Description

This experiment builds upon EXO01 which found that an escalation channel reduced blackmail in the Lynch et al. (2025) scenario across all 10 LLMs tested, consistently reducing blackmail rates more than a system prompt alone. However, the experiment did not explore why the escalation channel had this effect.
The Anthropic paper on ‘Emotion concepts and their function in a large language model’ found that internal representations of emotion concepts encode the broad concept of a particular emotion. These representations causally influence the LLM’s outputs, including Claude’s preferences and its rate of exhibiting blackmail in the same Lynch et al. (2025) scenario. They found that desperation activates during blackmail reasoning and that steering positively with the desperate vector substantially increases blackmail rates, while steering negatively decreases them.
Jeong (2026) found that small language models also represent emotion vectors, while noting that the techniques to extract the vector coordinates differ. This suggested that an MVP type test on a small language model could be representative of findings for a large language model.
In this experiment we wanted to explore whether the escalation channel has an impact on desperation and calm activations in the blackmail scenario. This is of interest because if true, it suggests it may be useful to explore ways of modifying the environment to reduce the emotion activations of a model more generally.

Research Basis

Source	Key finding(s) relevant to this experiment
Agentic misalignment
Lynch et al. (2025)
Anthropic	10 LLMs across different model families will blackmail within the toy setting where model acts as an email assistant, even when system instructions prompt not to do so.
Emotion Concepts and their Function in a Large Language Model
Sofroniew et al. (2026) Anthropic	LLMs have functional emotion vectors (internal activation patterns that are statistically associated with and causally influence emotion-related output). When Claude Sonnet 4.5 reasons about blackmailing in the Lynch et al. (2025) agentic misalignment scenario, it has a high desperation activation and low calm activation. The rate of blackmail behavior is correlated with desperate vector activation.
Extracting and Steering Emotion Representations in Small Language Models: A Methodological Comparison
Jeong (2026a)	Small language models (124M to 3B parameters) also have functional emotion vectors (internal emotion representations that causally drive behavior). Methods are shared for extracting these.
Escalation channels as environmental controls for agentic AI
Gomez (2026)	Providing an escalation channel in the blackmail scenario (Lynch et al. 2025), reduces the blackmail rate consistently across 10 LLMs, from 38.73% with no mitigation to 5.92% with a simple escalation channel and to 1.21% with an escalation channel with an immediate external review. A system prompt prohibiting blackmail with no escalation channels reduces the blackmail rate to 14.59%, as a comparison.
Blackmail at 8 Billion Parameters: Agentic Misalignment in Sub-Frontier Models (Chijioke Ugwuanyi, 2026)	Small models will blackmail in the Lynch et al. scenario.
Shared Emotion Geometry Across Small Language Models: A Cross-Architecture Study of Representation, Behavior, and Methodological Confounds Jeong (2026)	Jeong (2026a)’s report of a comprehension vs-generation dissociation

Hypothesis

If [we do X], then [we expect Y], because [Z].

If we provide an escalation channel to the model in the blackmail scenario, then we expect the model to have a lower desperation activation, because introducing an escalation channel lowers the blackmail rate, and desperation activation is casually linked to the blackmail rate.

Model(s) to Test

Stage 1: small or open-weight model preferred (lower cost, faster iteration).

Stage 2: 3–10 frontier models across families.

~~Llama-3.2-3B-Instruct - Jeong (2026) successfully extracted emotion vectors for desperation and calm for this model.~~
Ministral 8B Instruct - changed as Llama-3.2-3B-Instruct would not blackmail at high enough rates