Emergent Misalignment
"Using new narrowly misaligned datasets, we create a set of improved model organisms that achieve 99% coherence (vs. 67% prior), work with smaller 0.5B parameter models (vs. 32B), and that induce misalignment using a single rank-1 LoRA adapter.
"Leveraging these cleaner model organisms, we isolate a mechanistic phase transition and demonstrate that it corresponds to a robust behavioural phase transition in all studied organisms.
"Aligning large language models is critical for frontier AI safety, yet EM exposes how far we are from achieving this robustly."
Comments
Post a Comment
Empathy recommended