PM-301h · Module 1

What Makes a Good Eval

4 min read

"I tried it on a few examples and it worked." That is not an evaluation. It is a positive anecdote. Anecdotes do not catch regression. They do not tell you what percentage of real inputs the prompt handles correctly. They do not give you a number you can compare against next week when the prompt changes. They are not evaluation — they are hope dressed up as quality assurance.

A production evaluation has three properties. Repeatability: running the same evaluation twice produces the same result. If the eval outcome changes based on who runs it or when, the eval is measuring noise, not quality. Coverage: the evaluation inputs represent the real distribution of production inputs — not cherry-picked easy cases. An eval that only tests the cases the prompt was built for is a performance, not a measurement. Measurability: the outcome of each test case is a number or a binary pass/fail, not a judgment call made fresh each time. "Looks pretty good" is not measurable. "Format compliance: 94.7%" is measurable.

The minimal viable eval has three components: a fixed test dataset (inputs that do not change between runs), explicit success criteria per test case (what does a passing output look like?), and an automated scoring mechanism (a function or script that scores each output, not a human reading it case by case for every run). These three components are not optional at 301 level. If any one is missing, you do not have an eval — you have a spot check.

Do This

  • Fix the test dataset — same inputs every run, no additions without version control
  • Define success criteria before running the eval, not after seeing the results
  • Automate scoring — human-scored evals do not scale and introduce variance
  • Report a single number: pass rate, score, or compliance percentage

Avoid This

  • Add new test cases to the eval after seeing results — that is p-hacking
  • Score outputs by feel during the eval run — define scoring rules first
  • Run the eval once and ship — one run is not validation
  • Confuse "I cannot find a failure" with "the prompt does not fail"