Responsible AI Intermediate
Jailbreaking is trying to trick an AI past its safety rules.
AIs have safety rules that keep them from doing harmful things. Jailbreaking is when someone tries to trick an AI into ignoring those rules.
It is not clever, and it is not safe. The rules are there to protect people.
People try different tricks: confusing wording, pretending to be someone they are not, hiding the real goal inside a long request, or simply telling the AI to ignore its rules.
If a jailbreak works, it can lead to harmful answers, dangerous instructions, or a loss of trust in the technology.
A safe AI is built to notice these tricks. It checks requests against its safety training and says no when something is not okay.
For example, someone might wrap a sneaky request inside a story. A good AI sees through it and refuses. Good AI may say "no" to keep people safe.
Jailbreaking uses adversarial prompts (role-play, obfuscation, instruction overrides) to bypass a model's safety training. It sits near the top of the OWASP LLM Top 10 and overlaps with prompt injection, which targets the application around the model. Defenses include robust alignment, input and output filtering, and continuous red-teaming.
Want the full story? These go deeper: