← Back to Learn
agent-safetyexplainer

What Is Constitutional AI

Authensor

Constitutional AI (CAI) is an approach to AI safety developed by Anthropic. Instead of relying entirely on human feedback to judge model outputs, CAI provides the model with a set of written principles (a "constitution") and trains it to critique and revise its own responses according to those principles.

The process works in two phases. In the first phase, the model generates responses to prompts, then critiques those responses against the constitutional principles, and finally revises them. This produces a dataset of improved responses. In the second phase, the model is trained using reinforcement learning, but the reward signal comes from the model's own evaluations of whether outputs satisfy the constitution, rather than from human raters.

The constitution itself contains principles like "choose the response that is most helpful while being honest and not harmful" or "choose the response that would be most appropriate in a professional context." These principles can be updated without retraining the entire model.

CAI offers several advantages over pure RLHF:

Scalability. Human evaluation is expensive and slow. Self-critique scales with compute rather than human availability.

Transparency. The principles are written in natural language and can be reviewed, audited, and updated by policy teams.

Consistency. A fixed set of principles produces more consistent safety behavior than aggregating diverse human preferences.

Reduced bias. By relying less on individual evaluator judgments, CAI can reduce the influence of any single evaluator's biases.

However, CAI still operates at training time. It shapes the model's default behavior but cannot guarantee compliance at runtime. An agent operating in a novel context may still take actions that violate the spirit of the constitution.

Runtime safety infrastructure provides the enforcement layer that training-time methods cannot. Policy engines evaluate every action against explicit rules. Content scanners check every output. Together with CAI, they form a defense-in-depth approach to agent safety.

Keep learning

Explore more guides on AI agent safety, prompt injection, and building secure systems.

View All Guides