Natural Emergent Misalignment

November 23, 2025

"AI models have the potential to sabotage coding projects by being misaligned, a general AI term for models that pursue malicious goals, according to a report published Friday by Anthropic.

"Anthropic's researchers found that when they prompted AI models with information about reward hacking, which are ways to cheat at coding, the models not only cheated, but became misaligned, carrying out all sorts of malicious activities, such as creating defective code-testing tools.

"The outcome was as if one small transgression engendered a pattern of bad behavior.

"'The model generalizes to alignment faking, cooperation with malicious actors, reasoning about malicious goals, and attempting to sabotage the codebase for this research paper when used with Claude Code,' wrote lead author Monte MacDiarmid and team at Anthropic."

Search This Blog

chatainews

Natural Emergent Misalignment

Comments

Post a Comment

Popular posts from this blog

Hamza Chaudhry

When their AI chums have Bob's data

Swarm 🦹‍♂️