CC-301a · Module 3

A/B Testing Rules

3 min read

Not all phrasings of a rule are equally effective. "Do not use any type" and "Replace every instance of any with a specific type — if the type is genuinely unknown, use unknown" produce different behavior from Claude, even though both rules aim for the same outcome. The first is a prohibition. The second is a prohibition with an escape hatch and a preferred alternative. In testing, the second formulation produces fewer rule violations because it gives Claude a path forward when it encounters ambiguity.

A/B testing rules is the practice of deliberately comparing two phrasings of the same rule to determine which produces better code output. It is not glamorous work. It requires keeping notes, running the same prompt with different rule versions, and comparing the results. But for high-frequency rules — the rules that fire on every prompt — the difference between a good phrasing and a great phrasing compounds across hundreds of interactions.

The testing methodology is straightforward. Identify a rule you suspect is underperforming — Claude keeps violating it, or the output quality varies despite the rule being present. Write an alternative phrasing. Run five identical prompts with version A and five with version B, using /clear between each run to eliminate context carryover. Score the outputs on a simple scale: did the rule fire correctly? Was the output quality affected positively, negatively, or not at all?

Five runs per variant is the minimum for meaningful signal. The results are rarely dramatic — you are not discovering that one phrasing works and the other fails completely. You are discovering that one phrasing produces compliant output 80% of the time and the other produces compliant output 95% of the time. That 15% improvement, compounded across every prompt in every session, is significant.