OC-301a · Module 3

Monitoring & Alerting

3 min read

You cannot manage what you cannot measure. Monitoring is not a feature you add after deployment — it is a prerequisite for deployment. An unmonitored agent system is a system you find out is broken when a customer calls to complain. By then the damage is done, the logs may be incomplete, and the root cause is buried under hours of unobserved drift. Monitoring transforms invisible failures into visible events that get handled before they cascade.

The monitoring stack has four layers. Infrastructure metrics: CPU, memory, disk, network for every compute node. Application metrics: agent response times, task completion rates, error rates, queue depths. Business metrics: decisions made per hour, escalation rates, council deliberation times, skill generation rates. Anomaly detection: machine learning models that flag when any metric deviates beyond historical norms. Each layer catches a different class of problem. Infrastructure metrics catch hardware failures. Application metrics catch code bugs. Business metrics catch operational drift. Anomaly detection catches the problems you did not think to look for.

Alert Level 1: Warning A metric is trending toward a threshold but has not breached it. Action: log the alert, notify the on-call team via Slack, continue monitoring. Example: CPU usage on a compute node has been above 80% for 15 minutes.
Alert Level 2: Critical A metric has breached its threshold. Action: page the on-call engineer, auto-scale if configured, pause non-essential agent operations. Example: error rate exceeds 5% for any agent over a 5-minute window.
Alert Level 3: Emergency A service-affecting incident is in progress. Action: page the incident commander, activate the runbook, halt self-modification processes, preserve all logs. Example: the coordination layer is unreachable from more than one availability zone.