OC-201c · Module 2

Health Checks & Heartbeats

3 min read

A heartbeat answers one question: is the agent alive? A health check answers a harder question: is the agent healthy? An agent can be alive (process running, accepting messages) but unhealthy (database connection dropped, API quota exhausted, cron scheduler stuck). Health checks probe each subsystem independently and report a holistic status. If the heartbeat says the agent is running but the health check says the database is unreachable, you know exactly what to fix.

Build health checks for each critical dependency. Database connectivity — can the module read from and write to the database? API reachability — does a lightweight ping to each API return a success response? Cron scheduler — has the scheduler fired its most recent scheduled job within the expected window? Disk space — is the log directory consuming more than 80% of available storage? Each check returns pass, warn, or fail. A single failure does not necessarily mean the whole agent is down — the CRM module can operate while the weather API is unreachable. But aggregated health status tells you whether the system is green, yellow, or red.

1. Define Critical Dependencies List every external system your agent depends on: database, APIs, file system, network, cron scheduler. Each one gets its own health check.
2. Build Lightweight Probes Each health check should be fast (under 2 seconds) and non-destructive. A database check runs SELECT 1. An API check hits the provider's status endpoint. A disk check reads filesystem usage. Never modify state during a health check.
3. Aggregate and Report Combine all check results into a single status report: green (all passing), yellow (warnings present), red (critical failures). Send the report to your monitoring channel. When something goes red, the report tells you which specific subsystem needs attention.