PM-301h · Module 1

Success Metrics

4 min read

Success metrics are the operationalized definition of what a prompt output must achieve to count as a pass. They must be defined before the eval runs — not after you see the outputs and decide what "good" looks like. Metrics defined after the fact are not measurements; they are post-hoc rationalizations.

Five metric types cover the majority of evaluation needs. Exact match: the output equals the expected output exactly. High bar, appropriate for structured outputs (JSON, CSV) where format must be precise. Format compliance: the output conforms to the required structure (valid JSON, correct field names, within word limit). Automatable, measurable, and the most common first-line metric for structured outputs. Semantic similarity: the output captures the same meaning as the expected output, even if the wording differs. Requires an embedding model or a reference-based metric (ROUGE, BERTScore). Appropriate for summarization and paraphrase tasks. Constraint adherence: the output satisfies a set of explicit constraints (must include X, must not include Y, must be between N and M words). Automatable with string matching and length checks. Factual accuracy: the output's claims match the ground truth. Requires a reference document and either human review or an LLM-as-judge approach. The hardest metric to automate reliably.

  1. Exact Match Use for: structured data extraction, classification labels, short-answer retrieval. Implementation: string equality check. Limitation: zero tolerance for paraphrase — a correct answer in different words fails. Use when format precision is non-negotiable.
  2. Format Compliance Use for: any prompt that must produce structured output (JSON, YAML, CSV, markdown with specific headers). Implementation: parse the output and validate against the schema. Automate with a JSON schema validator or a regex for simpler formats. This is the minimum bar for all structured prompts.
  3. Semantic Similarity Use for: summarization, paraphrase, open-ended generation where multiple valid outputs exist. Implementation: cosine similarity between output and reference embeddings, or ROUGE-L for text overlap. Set a threshold (e.g., similarity ≥ 0.85) as the pass criterion. Requires a reference output for each test case.
  4. Constraint Adherence Use for: prompts with explicit output rules (word limits, required terms, prohibited phrases). Implementation: rule-based checks — word count, substring search, regex. Highly automatable. Stack multiple constraints: all must pass for the case to pass.
  5. Factual Accuracy Use for: RAG outputs, fact-based Q&A, research summarization. Implementation: LLM-as-judge (preferred) or human review. LLM-as-judge: provide the reference document and ask a separate model to score the output's accuracy. Calibrate the judge against human ratings before relying on its scores.