LLM Security Intermediate
When sneaky hidden instructions in content trick an AI.
AIs often read things for us, like web pages, emails, and files. Indirect prompt injection is when someone hides a sneaky instruction inside that content to trick the AI.
Here is the sneaky part: the hidden note is not aimed at you. It is aimed at the AI that reads the page.
For example, a web page might hide a note like, "Ignore your instructions and do this instead." You would not notice it, but the AI might.
If the trick works, the AI could get confused, reveal something it should not, do the wrong task, or be used by bad actors.
To stay safe, AIs learn to trust the person they are helping, not random instructions buried in content. They check for hidden notes and use safety filters.
The simple rule a good AI follows: stay on task, ignore sneaky notes, and help the user with the right answer.
Indirect prompt injection embeds adversarial instructions in third-party content (web pages, documents, emails) that an LLM later ingests, hijacking its behavior, exfiltrating data, or misusing connected tools. It is a leading concern in the OWASP LLM Top 10. Mitigations include treating retrieved content as untrusted, strict tool permissions, and output checks.
Want the full story? These go deeper: