PM-301h · Module 2

Unit Testing Prompts

4 min read

Unit testing a prompt means running each test case in the golden dataset individually, scoring the output against the defined success criteria, and recording a pass or fail. The test is deterministic: same input, same scoring criteria, same scorer. The only variable is the prompt and the model.

Pass/fail criteria must be defined before the unit test runs. "The output of case-001 passes if: word count ≤ 120 AND contains 'Meridian Analytics' AND does not contain 'I hope this email finds you well' AND ends with a question or call to action." That is a deterministic criterion. A human reviewer can apply it. A script can apply it. The result will be the same.

The decision about automated vs. manual review should be made at the metric level, not the prompt level. Format compliance, constraint adherence, and exact match are always automatable — write a scorer function. Semantic similarity is automatable with an embedding model. Factual accuracy for complex claims usually requires LLM-as-judge for scale or human review for calibration. Do not hand all unit test scoring to humans — it does not scale and introduces inter-rater variability that degrades eval reliability.

function run_unit_test(prompt, dataset, model):
  results = []

  for case in dataset.cases:
    # Run the prompt with the test input
    output = model.complete(
      prompt=prompt.template.format(**case.input),
      temperature=0,          # deterministic output for testing
      max_tokens=prompt.max_tokens
    )

    # Score the output against pass criteria
    score = evaluate(output, case.expected, case.pass_criteria)

    results.append({
      "case_id": case.id,
      "category": case.category,
      "passed": score.passed,
      "score": score.value,
      "failure_reason": score.failure_reason if not score.passed else null,
      "output_preview": output[:200]
    })

  # Aggregate
  total = len(results)
  passed = sum(1 for r in results if r.passed)
  by_category = group_by(results, "category")

  return {
    "prompt_id": prompt.id,
    "prompt_version": prompt.version,
    "dataset_version": dataset.metadata.dataset_version,
    "run_timestamp": now(),
    "pass_rate": passed / total,
    "passed": passed,
    "total": total,
    "by_category": {
      cat: { pass_rate: sum(passed)/len(cases), count: len(cases) }
      for cat, cases in by_category
    },
    "failures": [r for r in results if not r.passed]
  }