PM-301h · Module 3

Automated Evaluation Pipelines

4 min read

Manual evaluation does not scale. A team with 50 active prompts and a monthly release cadence cannot hand-score every eval run for every prompt version. Automated evaluation pipelines run tests, score outputs, and report results without human intervention — flagging only the cases that require human judgment.

The most powerful automated evaluation pattern is LLM-as-judge: using a second model to evaluate the outputs of the first. The evaluator model receives the input, the output, and a rubric, and returns a structured judgment: pass/fail, score on each criterion, and the reasoning behind any failure. This enables semantic and quality evaluation at scale — catching outputs that technically comply with format constraints but are low-quality, off-tone, or factually incomplete.

The critical design decision in LLM-as-judge is the evaluator prompt. It must be more precise than the production prompt it evaluates. Vague evaluator prompts produce inconsistent judgments, which is worse than no automated evaluation — it gives the appearance of rigor without the substance.

You are a rigorous evaluator for AI-generated sales emails.

You will be given:
- INPUT: The variables provided to the email generation prompt
- OUTPUT: The email generated by the prompt
- CRITERIA: The specific criteria the email must meet

Evaluate the OUTPUT against each criterion. Return a JSON object with this exact schema:
{
  "overall_pass": boolean,
  "score": number,  // 0.0 to 1.0
  "criteria_results": [
    {
      "criterion": string,
      "passed": boolean,
      "reason": string  // Required when passed=false. One sentence maximum.
    }
  ],
  "failure_summary": string | null  // Required when overall_pass=false
}

CRITERIA:
1. Word count is 120 or fewer words. Count every word including greeting and signature.
2. The recipient's company name appears at least once.
3. The email ends with a specific call to action (a question or a proposed next step).
4. The email does not use any of these phrases: "I hope this email finds you well",
   "synergy", "leverage" (as a verb), "circle back", "touch base".
5. The tone is direct and professional — not overly casual, not stiff.

INPUT:
{{input}}

OUTPUT:
{{output}}

Return only the JSON object. Do not add commentary outside the JSON.

1. Build the Evaluator Prompt Write a precise rubric with objective, scoreable criteria. Each criterion must have a clear pass/fail definition. Ambiguous criteria produce inconsistent evaluator judgments.
2. Calibrate Against Human Judgments Before deploying the automated evaluator, run it on 50 cases that have been previously human-scored. Calculate agreement rate. If the evaluator agrees with humans on fewer than 85% of cases, the evaluator prompt needs revision.
3. Run at Full Scale Once calibrated, run the automated evaluator on the full golden dataset. Human review is triggered only for cases where the evaluator's confidence is low (as indicated in the scoring rationale) or where the overall score is in the pass/fail boundary zone.
4. Monitor Evaluator Drift Re-calibrate the evaluator quarterly against fresh human judgments. Model updates can shift evaluator behavior without notice. A miscalibrated evaluator is worse than no evaluator — it reports wrong scores with false confidence.