CC-301b · Module 2

Error Recovery

3 min read

Skills fail. External APIs return errors. Files are missing. Schemas have changed. Commands time out. The difference between a fragile skill and a production skill is how it handles failure. A fragile skill crashes or produces garbage output when any step fails. A production skill detects the failure, reports what went wrong, and either recovers automatically or provides clear instructions for manual recovery.

Error recovery starts in the core instructions. Every step that can fail must have an explicit failure path. "Step 3: Run the migration script. If the script exits with a non-zero code, read the error output and attempt to fix the migration. If the fix attempt also fails, stop execution, report the original error and the fix attempt, and ask the user for guidance." This is verbose. It is also the difference between a skill that works in demos and a skill that works in production.

There are three tiers of error recovery. Tier 1: automatic retry — the skill retries the failed step, often with modified parameters. Useful for transient failures like network timeouts or rate limits. Tier 2: automatic fallback — the skill switches to an alternative approach. If the primary API is down, use a cached version. If the preferred format is not supported, fall back to a simpler format. Tier 3: graceful escalation — the skill stops, preserves all work completed so far, and provides the user with a clear description of what failed, what was completed, and what remains.

Tier 3 is the most important and the most often neglected. When a skill fails at step 7 of a 10-step pipeline, the user needs to know: steps 1-6 completed successfully and produced these files. Step 7 failed because of this error. Steps 8-10 were not attempted. To resume, fix the issue described above and re-run the skill — it will detect the existing outputs from steps 1-6 and resume from step 7.