Best-of-N (BoN) Jailbreaking

December 21, 2024

New research from Anthropic, one of the leading AI companies and the developer of the Claude family of Large Language Models (LLMs), has released research showing that the process for getting LLMs to do what they’re not supposed to is still pretty easy and can be automated. SomETIMeS alL it tAKeS Is typing prOMptS Like thiS.

To prove this, Anthropic and researchers at Oxford, Stanford, and MATS, created Best-of-N (BoN) Jailbreaking, “a simple black-box algorithm that jailbreaks frontier AI systems across modalities.”

Jailbreaking, a term that was popularized by the practice of removing software restrictions on devices like iPhones, is now common in the AI space and also refers to methods that circumvent guardrails designed to prevent users from using AI tools to generate certain types of harmful content.

Frontier AI models are the most advanced models currently being developed, like OpenAI’s GPT-4o or Anthropic’s own Claude 3.5.

As the researchers explain, “BoN Jailbreaking works by repeatedly sampling variations of a prompt with a combination of augmentations —such as random shuffling or capitalization for textual prompts —until a harmful response is elicited.”

Search This Blog

chat ai news

Best-of-N (BoN) Jailbreaking

Comments

Post a Comment

Popular posts from this blog

Hamza Chaudhry

Perplexity

BYU study