OC-201c · Module 1

Production Readiness Checklist

3 min read

There is a gap between "it works on my machine" and "it runs reliably in production." That gap is not about code quality — it is about operational infrastructure. Your module works in development because you are watching it, restarting it when it crashes, and fixing issues in real time. Production means nobody is watching. The system must restart itself, recover from failures, and alert you when something requires human attention. That operational infrastructure is what this module teaches.

The production readiness checklist has five categories. Process management — does OpenClaw restart automatically after a crash or reboot? Environment isolation — are API keys, database credentials, and configuration separated from code? Logging — are errors, warnings, and operational events written to persistent storage where you can find them later? Backup — can you restore the system from zero if the machine dies? Alerting — does the system tell you when something is wrong, or do you discover it when a scheduled task does not fire?

1. Process Management Use a process manager (PM2, systemd, or launchd) to ensure OpenClaw restarts after crashes and machine reboots. Verify by rebooting the machine and confirming the agent comes back online without manual intervention.
2. Environment Configuration Move all API keys, credentials, and environment-specific values to .env files or OS-level environment variables. The codebase should contain zero secrets. If you can push the entire repo to a public GitHub without exposing credentials, your environment configuration is correct.
3. Persistent Logging Configure log output to rotate files that persist on disk. Console output disappears when the terminal closes. File-based logs survive process restarts, crashes, and reboots. You need logs from last Tuesday, not just today.
4. Automated Backup Schedule code syncs to GitHub (hourly) and database backups to cloud storage (daily). Maintain a restore document that describes how to rebuild the system from zero using only git and the backup files.
5. Basic Alerting Configure a heartbeat check that pings you if the agent goes offline for more than 5 minutes. A simple cron job that sends a Telegram message every 15 minutes is a valid heartbeat. If the messages stop, the agent is down.