Spin Glass Theory

Jailbreak Scaling Laws Explained: How AI Safety Cracks Under Pressure Large language models (LLMs) like GPT-4 or Llama are engineered with safety alignments to refuse harmful requests, but clever “jailbreak” prompts can trick them into unsafe outputs. A groundbreaking paper, “Jailbreak Scaling Laws for Large Language Models: Polynomial-Exponential Crossover”, reveals why these attacks explode in effectiveness with more computational effort, shifting from slow polynomial growth to rapid exponential success. This post demystifies the research for technical readers without a PhD in physics, using everyday analogies, real-world examples, and practical insights. ...