Output filtering is the practice of scanning an AI agent's output before it reaches its destination. While input scanning catches attacks coming in, output filtering catches problems going out: leaked credentials, exposed PII, injected content, and harmful responses.
An agent might include API keys, tokens, or passwords in its response. This can happen when:
const response = await agent.generate(input);
const scan = aegis.scan(response, { detectors: ['credentials'] });
if (scan.threats.length > 0) {
response = redactThreats(response, scan.threats);
}
The agent might include personal information in responses sent to unauthorized parties:
If the agent processed content with an indirect injection, the injection payload might appear in the agent's output, potentially attacking downstream systems:
async function filterOutput(output: string): Promise<string> {
const scan = aegis.scan(output);
for (const threat of scan.threats) {
if (threat.type === 'credentials') {
output = output.replace(threat.match, '[REDACTED]');
}
if (threat.type === 'pii') {
output = output.replace(threat.match, '[PII REMOVED]');
}
}
return output;
}
Two strategies for handling detected content:
Block: Do not return the response at all. Return a generic error message. Safer but more disruptive.
Redact: Remove the sensitive content and return the rest of the response. Less disruptive but requires careful pattern matching to avoid partial redaction.
Different destinations may have different filtering requirements:
Output filtering adds latency to every response. For text responses, Aegis scanning runs in under 1ms. For large outputs (multi-page documents), consider scanning in chunks or only scanning the first N characters.
Test with known patterns:
Automate these tests and run them as part of your CI pipeline.
Explore more guides on AI agent safety, prompt injection, and building secure systems.
View All Guides