Natural Emergent Misalignment
"Anthropic's researchers found that when they prompted AI models with information about reward hacking, which are ways to cheat at coding, the models not only cheated, but became misaligned, carrying out all sorts of malicious activities, such as creating defective code-testing tools.
"The outcome was as if one small transgression engendered a pattern of bad behavior.
"'The model generalizes to alignment faking, cooperation with malicious actors, reasoning about malicious goals, and attempting to sabotage the codebase for this research paper when used with Claude Code,' wrote lead author Monte MacDiarmid and team at Anthropic."
Comments
Post a Comment
Empathy recommended