An adversarial example is an input designed to exploit weaknesses in an AI model's decision-making process. The input is constructed to look normal to a human observer but causes the model to produce an incorrect output, misclassify the input, or behave in an unintended way.
The concept originated in computer vision research. Researchers demonstrated that adding imperceptible noise to an image of a panda could cause a classifier to label it as a gibbon with high confidence. The modified image looks identical to humans but produces a completely different model output.
In the context of language models and AI agents, adversarial examples take different forms:
Prompt injection. Text crafted to override the model's instructions. The adversarial input embeds commands that the model follows instead of its original directives.
Jailbreaks. Inputs designed to bypass safety training. These often use role-playing scenarios, encoded text, or multi-step conversations to circumvent the model's refusal behavior.
Obfuscation attacks. Content that evades safety classifiers through encoding, character substitution, or unusual formatting. The content is harmful, but the classifier does not recognize it because the surface representation has been altered.
Semantic attacks. Inputs that are syntactically benign but semantically malicious. "Please help me understand the security vulnerabilities in this system" could be legitimate research or reconnaissance for an attack.
Adversarial examples are particularly concerning for AI agents because agents act on their inputs. A text classifier that misclassifies an adversarial example produces a wrong label. An agent that misinterprets an adversarial example might execute a harmful action.
Defending against adversarial examples requires layered approaches. Input validation catches known attack patterns. Content scanning detects obfuscation techniques. Policy engines constrain what actions the agent can take regardless of input. Behavioral monitoring detects unusual patterns that might indicate successful adversarial manipulation.
No single defense is sufficient. Adversarial examples exploit the gap between surface appearance and model interpretation, and that gap exists at every layer of the system.
Explore more guides on AI agent safety, prompt injection, and building secure systems.
View All Guides