PM-301h · Module 3

Evaluation Failure Patterns

4 min read

A miscalibrated evaluation is worse than no evaluation. It provides a false signal: high eval scores for prompts that fail in production, low eval scores for prompts that actually work well, or consistent scores that mask systematic failure on the cases that matter most. Understanding the ways evaluations fail is as important as understanding how to build them correctly.

Four failure patterns account for most misleading evaluations. Goodhart's Law in prompt evals: the metric becomes the target rather than the proxy for quality. When teams optimize prompts specifically to score well on the eval metrics, the metrics stop measuring what they were designed to measure. The prompt produces high format compliance scores but low real-world utility because it was tuned to the format, not the task. Dataset contamination: the eval dataset is not independent from the development process — the prompt was adjusted based on these exact inputs, so passing the eval is not evidence of generalization. Evaluation metric gaming: the prompt learns to trigger the specific patterns the automated scorer looks for without achieving the underlying goal — producing outputs that contain the required terms without them being meaningful in context. Evaluation coverage gaps: the eval tests what was easy to test, not what is most important to test — high scores on simple cases with no coverage of the failure modes that actually occur in production.

Do This

  • Build the golden dataset before finalizing the prompt
  • Monitor production quality signals alongside eval scores
  • Review prompts that score suspiciously high — real prompts have failure modes
  • Include adversarial cases that target your specific success metrics

Avoid This

  • Iterate on the prompt with the golden dataset visible — this contaminates it
  • Treat high eval scores as sufficient evidence of production quality
  • Design metrics that are easy to score instead of metrics that matter
  • Ignore the distribution of failure modes — aggregate pass rate hides patterns