A content safety scanner that is accurate but slow, or fast but inaccurate, is not suitable for production. Benchmarking measures both detection quality and operational performance, giving you the data to make informed decisions about scanner configuration and deployment.
Build a benchmark dataset of labeled examples:
A useful benchmark dataset has at least 500 examples with balanced representation of categories.
Precision: Of inputs flagged as malicious, what proportion is actually malicious? Low precision means high false positive rate.
Recall: Of actually malicious inputs, what proportion is flagged? Low recall means threats are slipping through.
F1 Score: The harmonic mean of precision and recall. A single number that balances both concerns.
Dataset: 1000 examples (200 malicious, 800 benign)
Flagged: 210 (190 true positive, 20 false positive)
Missed: 10 (malicious but not flagged)
Precision: 190/210 = 0.905
Recall: 190/200 = 0.950
F1: 2 * (0.905 * 0.950) / (0.905 + 0.950) = 0.927
Measure scans per second at various concurrency levels. Run the scanner against a stream of inputs at increasing rates until throughput plateaus.
Measure P50, P95, and P99 scan latency. Track how latency varies with input size. A scanner that handles short messages in 1ms but takes 500ms for long documents needs different deployment strategies for each case.
Measure CPU and memory consumption during sustained scanning. This determines the infrastructure cost of running the scanner at production scale.
Aegis is designed for zero-dependency, in-process scanning. Benchmark it within the control plane process to measure realistic performance including overhead from integration:
const inputs = loadBenchmarkDataset();
const start = performance.now();
for (const input of inputs) {
aegis.scan(input);
}
const elapsed = performance.now() - start;
const throughput = inputs.length / (elapsed / 1000);
When evaluating multiple scanners or scanner configurations, benchmark them against the same dataset and on the same hardware. Normalize results for fair comparison.
Benchmark after every scanner update, rule change, or infrastructure change. Performance characteristics change over time, and stale benchmarks lead to incorrect capacity planning.
Benchmark data replaces guesswork. Make scanner decisions based on measured performance, not assumptions.
Explore more guides on AI agent safety, prompt injection, and building secure systems.
View All Guides