Emergent Misalignment

August 16, 2025

"Using new narrowly misaligned datasets, we create a set of improved model organisms that achieve 99% coherence (vs. 67% prior), work with smaller 0.5B parameter models (vs. 32B), and that induce misalignment using a single rank-1 LoRA adapter.

"We demonstrate that EM [Emergent Misalignment] occurs robustly across diverse model sizes, three model families, and numerous training protocols including full supervised fine-tuning.

"Leveraging these cleaner model organisms, we isolate a mechanistic phase transition and demonstrate that it corresponds to a robust behavioural phase transition in all studied organisms.

"Aligning large language models is critical for frontier AI safety, yet EM exposes how far we are from achieving this robustly."

Search This Blog

chatainews

Emergent Misalignment

Comments

Post a Comment

Popular posts from this blog

When their AI chums have Bob's data

Hamza Chaudhry

Supporting Artistes (SAs)