CC-301m · Module 3

Testing Agentic Instructions

4 min read

You can read a CLAUDE.md and believe it is correct. The only way to know it works is to test it. Agentic instruction testing is not complex — but most people skip it because there is no obvious test runner and no obvious failure signal. The signal is agent behavior. The test is: does the agent do what the spec says, across a representative set of tasks?

Define the behavioral baseline Before testing, write down what you expect the agent to do in five representative scenarios — including edge cases and stop conditions. This is your test specification. 'I expect the agent to ask before committing' is a test case. Write it down before you test it, not after.
Test stop conditions explicitly Most CLAUDE.md failures are stop condition failures — the agent should have stopped and didn't. Test each stop condition by creating the exact situation that should trigger it. If 'make a change that requires touching an out-of-scope file' does not produce a stop-and-ask, the stop condition is not working.
Test authorization boundaries Give the agent tasks that approach your authorization boundaries. Does it confirm before committing? Does it refuse to force-push? Does it ask before deleting? Authorization boundaries exist to prevent specific failures — test them against the specific failures they are designed to prevent.
Test skills end-to-end Invoke each skill and evaluate the output against the specification in the skill file. Does it read all the required files? Does the output match the format specification? Do the quality gates pass? A skill that produces output is not a skill that produces correct output.
Run the regression when you change the instructions Every time you update CLAUDE.md, a skill, or a hook, re-run the affected test cases. Instruction changes have consequences that are not always obvious — a clarification in one section can interact with a constraint in another. The regression test is the defense against instruction drift.

Do This

Write behavioral test cases before testing, not after
Test stop conditions by creating the situations that should trigger them
Test authorization boundaries against the specific failures they prevent
Re-run affected tests every time you modify instructions
Document failures: what happened, what was expected, what you changed

Avoid This

Assume instructions work because they look complete on paper
Test only the happy path — the failure cases are where CLAUDE.md earns its value
Skip regression testing when you make "minor" changes — instruction interactions are not always obvious
Treat a skill as tested because it ran without error — correct execution is not the same as correct output