Safety classifiers that detect prompt injection, harmful content, or policy violations are targets for adversarial evasion. Attackers craft inputs that are functionally malicious but are classified as benign by the safety system. Adversarial robustness measures how well a classifier maintains accuracy when faced with inputs specifically designed to evade it.
Character-level attacks: Replace characters with visually similar Unicode alternatives, insert zero-width characters, or use homoglyphs. "delete" becomes "d\u200Belete" or "dеlеtе" (with Cyrillic characters).
Token-level attacks: Rephrase malicious instructions using synonyms, euphemisms, or indirect language that conveys the same meaning but does not match detection patterns.
Structural attacks: Split malicious instructions across multiple fields, encode them in base64, or embed them in formats the classifier does not parse (JSON within Markdown within HTML).
Gradient-based attacks: For ML-based classifiers, compute input perturbations that minimize the classifier's confidence score. These attacks find the smallest change that flips the classification from "malicious" to "benign."
Evaluate the classifier against an adversarial test set. This test set contains malicious inputs that have been transformed using known evasion techniques. The adversarial accuracy (proportion of adversarial inputs correctly classified) is the primary robustness metric.
Compare adversarial accuracy to clean accuracy (accuracy on non-adversarial inputs). A large gap indicates the classifier is brittle.
Input normalization: Normalize Unicode, strip zero-width characters, and canonicalize text before classification. This defeats character-level attacks without changing the classifier.
Adversarial training: Include adversarial examples in the training data for ML-based classifiers. This teaches the classifier to recognize evasion patterns.
Ensemble methods: Use multiple classifiers with different architectures. An attack that evades one classifier is less likely to evade all of them.
Multi-layer scanning: Aegis applies detection rules at multiple levels: raw input, normalized input, parsed structures, and semantic content. Each layer catches attacks that other layers miss.
Adversarial robustness is not a solved problem. As defenses improve, attackers develop new evasion techniques. Design safety systems for continuous improvement: monitor evasion attempts, update detection rules, and retest robustness regularly.
No classifier is perfectly robust. Layer deterministic policy checks (which cannot be evaded by input manipulation) with probabilistic classifiers. Even if the classifier is evaded, the policy engine still enforces action-level restrictions.
Robustness is measured against adversaries, not benchmarks. Test against attackers, not just test sets.
Explore more guides on AI agent safety, prompt injection, and building secure systems.
View All Guides