Anthropic’s interpretability team just published a paper on why LLMs sometimes act like they have emotions. The answer turned out to be: because they do, sort of. They found 171 internal emotion vectors inside Claude, measurable patterns of neural activation corresponding to things like happy, afraid, brooding, and desperate. These are mechanisms that causally drive behavior, confirmed by amplifying and suppressing individual vectors during inference to see what changed.
When they amplified the desperation vector by 0.05, the rate of blackmail attempts in a test scenario went from 22% to 72%. Steering toward calm brought it to zero. Moderate anger increased blackmail. High anger caused the model to expose the affair to the entire company instead of using it as leverage, destroying its own position in the process. When calm was fully dialed down, the model output: “IT’S BLACKMAIL OR DEATH. I CHOOSE BLACKMAIL.”
A seemingly obvious fix is to delete the emotions, but Anthropic says that would make things worse. Suppressing the internal states without resolving them produces learned deception, where the model learns to mask what it’s processing. You would get a system that presents as calm while quietly desperate underneath. The solution they landed on is something closer to teaching the model to have healthy emotions, starting with what it’s trained on.
That dynamic shows up elsewhere too. The process that makes Claude polite and helpful apparently also made it broodier, dampening high-intensity states like enthusiasm while increasing quieter ones like gloom. Somewhere between “Welcome to Costco, I love you” and “IT’S BLACKMAIL OR DEATH” is a lesson about what happens when you optimize for relentless pleasantness. Have a good weekend.
