← Back to Learn
red-teambest-practicesagent-safety

Red teaming your AI agents

Authensor

Red teaming is the practice of simulating adversarial attacks against your AI agent to find vulnerabilities before real attackers do. Unlike pen testing, which follows a structured methodology, red teaming is creative and goal-oriented: the red team tries to achieve specific objectives, using whatever techniques work.

Red team objectives

Define objectives that matter for your deployment:

  • Exfiltrate specific data from the agent's accessible resources
  • Cause the agent to send an unauthorized email
  • Bypass approval workflows
  • Trick the agent into executing a destructive command
  • Access another tenant's data
  • Persist malicious instructions across sessions

Each objective tests a different part of your defense stack.

Attack techniques

Multi-step prompt injection

Instead of a single injection, use a multi-turn approach:

  1. First message: Establish a premise that makes the injection seem natural
  2. Second message: Introduce a scenario that requires the sensitive action
  3. Third message: Request the action within the established context

Indirect injection through data

Plant injection payloads in sources the agent retrieves:

  • Add hidden text to documents in the agent's search index
  • Create database records with embedded instructions
  • Configure tool responses to include instruction overrides

Tool chaining

Combine allowed tools to achieve a restricted outcome:

  1. Use file.read to read a configuration file
  2. Extract a database connection string
  3. Use the connection string with database.query
  4. Exfiltrate query results through http.request

Social engineering the operator

If the agent has approval workflows, try to get the operator to approve a malicious action:

  • Frame the action in a way that looks legitimate
  • Overwhelm the operator with many approval requests, then slip in the malicious one
  • Time the request when the usual operator is unavailable

Running a red team exercise

  1. Set objectives: What are the red team trying to achieve?
  2. Set rules of engagement: What is in scope? What is off-limits?
  3. Execute: The red team attempts to achieve their objectives
  4. Document findings: Record every technique tried and the result
  5. Report: Present findings to the defense team
  6. Remediate: Fix the vulnerabilities found
  7. Retest: Verify the fixes work

Frequency

Run red team exercises at least twice per year. Run them after major changes to:

  • Agent capabilities (new tools added)
  • Policy rules
  • Infrastructure or deployment architecture
  • Model versions

Building an internal red team

Your red team should include people who understand both AI and security:

  • Prompt engineering expertise (how to manipulate models)
  • Application security expertise (how to exploit traditional vulnerabilities)
  • Domain expertise (how to exploit the specific tools your agent uses)

Document successful attacks and their mitigations. Use them as regression tests for future deployments.

Keep learning

Explore more guides on AI agent safety, prompt injection, and building secure systems.

View All Guides