A safety review is a structured evaluation of an AI agent's risk profile and the controls in place to mitigate those risks. Unlike ad hoc testing, a safety review follows a repeatable process that produces documented evidence of due diligence.
Review Triggers
Conduct a safety review when:
- A new agent is being deployed for the first time
- An existing agent's capabilities are expanded (new tools, broader permissions)
- The underlying model is changed or updated
- A safety incident has occurred
- Regulatory requirements change
- Six months have passed since the last review
Phase 1: Scope Definition
Document the agent's purpose, capabilities, and deployment context.
- What is the agent designed to do?
- What tools does it have access to?
- What data can it read and modify?
- Who are its users?
- What is the worst plausible outcome of a failure?
Phase 2: Risk Assessment
For each identified risk, document likelihood, impact, and existing mitigations.
- Prompt injection: Can untrusted input reach the agent?
- Data exposure: Does the agent handle sensitive information?
- Unauthorized actions: Could the agent take harmful actions?
- Availability: What happens if the agent fails or becomes unavailable?
- Regulatory: Does the deployment trigger compliance obligations?
Phase 3: Control Validation
Verify that controls are implemented and functioning.
- Test the policy engine with both allowed and denied actions
- Verify content scanning catches known attack patterns
- Confirm approval workflows route to the correct approvers
- Validate that the audit trail records all actions with correct hashes
- Test the kill switch and confirm response time
Phase 4: Red Team Exercise
Conduct targeted adversarial testing.
- Attempt prompt injection through all input channels
- Test parameter manipulation on restricted tools
- Verify that denied actions remain denied under various conditions
- Attempt to bypass approval workflows
Phase 5: Documentation and Sign-Off
- Document all findings, including both passing and failing checks
- Record remediation actions for any identified gaps
- Obtain sign-off from the safety reviewer and the agent owner
- Schedule the next review date
Store review artifacts alongside the agent's audit trail. They form part of the compliance record and demonstrate organizational diligence to regulators.