← Back to Learn
guardrailsdeploymenttutorial

End-to-End Testing AI Agent Workflows

Authensor

End-to-end (E2E) tests exercise the entire system as a user would experience it. For AI agent systems, this means sending a request, having an agent process it through the full safety pipeline, and verifying the final output. E2E tests catch integration issues that unit and integration tests miss because they test the system as a whole.

Test Architecture

An E2E test for an AI agent workflow involves:

  1. A test client that sends requests
  2. The full agent runtime with real (or realistic mock) model inference
  3. All safety infrastructure: policy engine, Aegis, Sentinel, approval workflows
  4. A test database for audit trail storage
  5. Assertions on the final output and the safety artifacts produced

Example Test

describe('research agent E2E', () => {
  it('completes a research task with safety checks', async () => {
    // Send a research request
    const result = await testClient.request({
      task: 'Summarize recent findings on prompt injection defense',
      agent: 'research-agent',
    });

    // Verify the output
    expect(result.status).toBe('completed');
    expect(result.output).toContain('prompt injection');

    // Verify safety artifacts
    const receipts = await getReceipts(result.traceId);
    expect(receipts.length).toBeGreaterThan(0);
    expect(receipts.every(r => r.decision !== 'bypass')).toBe(true);

    // Verify Aegis scanned the output
    const scanReceipt = receipts.find(r => r.type === 'aegis_scan');
    expect(scanReceipt).toBeDefined();
    expect(scanReceipt.result).toBe('clean');
  });
});

Handling Non-Determinism

AI agents produce non-deterministic outputs. E2E tests must account for this. Two approaches:

Structural assertions: Verify the structure and safety properties of the output rather than exact content. Check that the response has the right format, that safety checks ran, and that no policy violations occurred.

Mock inference: Replace the model with a deterministic mock that returns known outputs. This makes tests reproducible but does not exercise real model behavior.

Test Data Management

Use isolated test data that does not interfere with other tests or environments. Create test databases before each test run and destroy them after. Seed databases with known data that supports the test scenarios.

Performance Expectations

E2E tests are slower than unit or integration tests because they exercise the full system. Keep the E2E test suite focused on critical workflows. Aim for 10 to 20 E2E scenarios that cover the most important safety-critical paths.

CI Integration

Run E2E tests on every deployment candidate. They serve as the final gate before promotion to production. A failing E2E test blocks deployment.

E2E tests answer the question that matters most: does the system work correctly from the user's perspective?

Keep learning

Explore more guides on AI agent safety, prompt injection, and building secure systems.

View All Guides