Responsible AI Intermediate

LLM Jailbreaking

Jailbreaking is trying to trick an AI past its safety rules.

Infographic: LLM Jailbreaking, showing tricks people use to bypass an AI's safety rules and how a safe AI refuses.
Download the poster

AIs have safety rules that keep them from doing harmful things. Jailbreaking is when someone tries to trick an AI into ignoring those rules.

It is not clever, and it is not safe. The rules are there to protect people.

People try different tricks: confusing wording, pretending to be someone they are not, hiding the real goal inside a long request, or simply telling the AI to ignore its rules.

If a jailbreak works, it can lead to harmful answers, dangerous instructions, or a loss of trust in the technology.

A safe AI is built to notice these tricks. It checks requests against its safety training and says no when something is not okay.

For example, someone might wrap a sneaky request inside a story. A good AI sees through it and refuses. Good AI may say "no" to keep people safe.

What to remember

  • Jailbreaking is tricking an AI past its safety rules.
  • Those rules exist to protect people.
  • Common tricks include confusing wording and pretending.
  • A good AI notices the trick and says no.

Words to know

Jailbreaking
Trying to trick an AI into ignoring its safety rules.
Safety rules
The limits that keep an AI from causing harm.
Refusal
When an AI says no to an unsafe request.
Red-teaming
Safety testing where experts try to break the rules on purpose.

For grown-ups

Jailbreaking uses adversarial prompts (role-play, obfuscation, instruction overrides) to bypass a model's safety training. It sits near the top of the OWASP LLM Top 10 and overlaps with prompt injection, which targets the application around the model. Defenses include robust alignment, input and output filtering, and continuous red-teaming.

Want the full story? These go deeper: