OC-201c · Module 3

The Diagnostic Playbook

4 min read

When your agent stops working, the worst thing you can do is start guessing. Change a setting, restart the process, try a different API key, reboot the machine — each blind fix takes time and may mask the actual problem. The diagnostic playbook is a systematic sequence of checks that isolates the failure layer by layer. Is the process running? Is the network connected? Is the API responding? Is the database accessible? Is the module itself throwing an error? Each check eliminates one layer and narrows the search.

The mean-time-to-explain metric applies here. If you cannot explain what failed and why within 5 minutes of starting diagnosis, the observability infrastructure is insufficient. You need better logs, better health checks, or better error messages. Every incident that takes longer than 5 minutes to diagnose is a signal to improve the monitoring, not just to fix the immediate problem. Fix the bug today. Fix the observability gap tomorrow. The second fix prevents future incidents from being equally painful.

1. Process Check Is the OpenClaw process running? Check PM2 status or system process list. If the process is down, check the logs for the crash reason before restarting. A restart without diagnosis hides the root cause.
2. Network Check Is the machine connected to the internet? Can it resolve DNS? Can it reach the Telegram API? Network failures are common on always-on machines — routers restart, DHCP leases expire, Wi-Fi drops in clamshell mode.
3. API Check Are the external APIs responding? Test each API with a minimal request. Rate limiting, expired tokens, and provider outages all present differently. The error message tells you which provider and what kind of failure.
4. Database Check Is the local database accessible? Can you read from it? Can you write to it? Disk full, corrupted index, or locked file scenarios all prevent database operations but present as module errors.
5. Module Check Run the failing module manually with known good input. If it succeeds, the issue is in the trigger or scheduling layer. If it fails, the error message from the isolated run tells you what is broken in the module logic.