Best-of-N (BoN) Jailbreaking

New research from Anthropic, one of the leading AI companies and the developer of the Claude family of Large Language Models (LLMs), has released research showing that the process for getting LLMs to do what they’re not supposed to is still pretty easy and can be automated. SomETIMeS alL it tAKeS Is typing prOMptS Like thiS. 


Jailbreaking, a term that was popularized by the practice of removing software restrictions on devices like iPhones, is now common in the AI space and also refers to methods that circumvent guardrails designed to prevent users from using AI tools to generate certain types of harmful content. 

Frontier AI models are the most advanced models currently being developed, like OpenAI’s GPT-4o or Anthropic’s own Claude 3.5.

As the researchers explain, “BoN Jailbreaking works by repeatedly sampling variations of a prompt with a combination of augmentations —such as random shuffling or capitalization for textual prompts —until a harmful response is elicited.”



Comments

Popular posts from this blog

Perplexity

Hamza Chaudhry