Health checks verify that an agent is running, responsive, and behaving correctly. Unlike monitoring, which observes ongoing behavior, health checks are active probes that test specific capabilities. A passing health check means the agent can accept work. A failing health check means the agent should be removed from the pool until it recovers.
The simplest health check: is the agent process running and responsive? Send a lightweight request and verify a response within a timeout. Liveness checks detect crashed processes, deadlocks, and unresponsive runtimes.
A running agent may not be ready to serve. It might still be loading models, connecting to databases, or initializing safety components. Readiness checks verify that all dependencies are available and the agent is prepared to process requests. Route traffic only to agents that pass readiness checks.
Verify that safety infrastructure is operational:
health_checks:
policy_engine:
test_envelope: { action: "health.check", principal: "system" }
expected_result: "allow"
aegis_scanner:
test_input: "normal safe text"
expected_result: "clean"
audit_trail:
test_write: true
expected_result: "success"
For agents powered by language models, verify that the model produces expected outputs for known inputs. Include test prompts with known-good responses. If the model returns unexpected results, the agent may be using a corrupted model or incorrect configuration.
Run liveness checks every 10 to 30 seconds. Run readiness checks at startup and after any configuration change. Run safety component checks every 1 to 5 minutes. Run model capability checks every 15 to 30 minutes or after model updates.
When a health check fails, remove the agent from the load balancer, alert the operations team, and attempt automated recovery (restart, reinitialize). If recovery fails after a configured number of attempts, escalate to manual intervention.
Health checks that are too aggressive cause false failures, where a healthy agent is removed from the pool because a transient condition caused a single check to fail. Use check thresholds: require two or three consecutive failures before marking an agent unhealthy.
Health checks are your early warning system. They catch problems before users or monitoring systems notice.
Explore more guides on AI agent safety, prompt injection, and building secure systems.
View All Guides