OC-301f · Module 3

CI/CD for Agent Systems

3 min read

Continuous integration for agent systems runs three test suites in sequence. First: the fast suite — unit tests and linting, completes in under 2 minutes. Gate: zero failures. Second: the integration suite — lifecycle tests, event handling, module compatibility, completes in 5-10 minutes. Gate: zero failures. Third: the behavioral suite — output quality tests with automated scoring, completes in 15-30 minutes. Gate: quality scores above threshold and no regressions against the baseline.

The behavioral suite is the expensive stage. Each behavioral test requires an LLM call, and the test should be run multiple times to account for non-determinism. The CI pipeline must balance thoroughness against cost and time. The strategy: run the full behavioral suite on pull requests that modify prompts, personas, or core logic. Run a smoke subset (5 critical tests) on all other PRs. Run the full suite nightly against the production configuration to catch environmental drift.

1. Fast Suite (< 2 min) Unit tests, linting, type checking. Runs on every commit. Zero-tolerance gate — any failure blocks the pipeline. These tests are cheap and fast.
2. Integration Suite (< 10 min) Lifecycle hooks, event handling, module compatibility. Runs on every PR. Zero-tolerance gate. These tests verify system behavior without LLM calls.
3. Behavioral Suite (< 30 min) Output quality with automated scoring, run 3x per test for non-determinism. Full suite on prompt/persona changes, smoke subset on other PRs, full suite nightly. Quality threshold gate — not zero-tolerance, but regression-intolerant.