PM-301h · Module 1
What Makes a Good Eval
4 min read
"I tried it on a few examples and it worked." That is not an evaluation. It is a positive anecdote. Anecdotes do not catch regression. They do not tell you what percentage of real inputs the prompt handles correctly. They do not give you a number you can compare against next week when the prompt changes. They are not evaluation — they are hope dressed up as quality assurance.
A production evaluation has three properties. Repeatability: running the same evaluation twice produces the same result. If the eval outcome changes based on who runs it or when, the eval is measuring noise, not quality. Coverage: the evaluation inputs represent the real distribution of production inputs — not cherry-picked easy cases. An eval that only tests the cases the prompt was built for is a performance, not a measurement. Measurability: the outcome of each test case is a number or a binary pass/fail, not a judgment call made fresh each time. "Looks pretty good" is not measurable. "Format compliance: 94.7%" is measurable.
The minimal viable eval has three components: a fixed test dataset (inputs that do not change between runs), explicit success criteria per test case (what does a passing output look like?), and an automated scoring mechanism (a function or script that scores each output, not a human reading it case by case for every run). These three components are not optional at 301 level. If any one is missing, you do not have an eval — you have a spot check.
Do This
- Fix the test dataset — same inputs every run, no additions without version control
- Define success criteria before running the eval, not after seeing the results
- Automate scoring — human-scored evals do not scale and introduce variance
- Report a single number: pass rate, score, or compliance percentage
Avoid This
- Add new test cases to the eval after seeing results — that is p-hacking
- Score outputs by feel during the eval run — define scoring rules first
- Run the eval once and ship — one run is not validation
- Confuse "I cannot find a failure" with "the prompt does not fail"