OC-301f · Module 2

Human Evaluation Loops

3 min read

Automated scoring catches structural failures — wrong format, missing information, persona drift. Human evaluation catches quality failures that require domain expertise — whether the analysis is insightful (not just factually correct), whether the recommendation is actionable (not just logically sound), and whether the output would pass muster with the intended audience. Both are necessary. Neither is sufficient alone.

The human evaluation loop: select a random 10% sample of agent outputs weekly. Route each sample to a domain expert reviewer with a structured evaluation form. The form asks: "Would you send this output to the client/stakeholder as-is? If not, what would you change?" The reviewer's corrections become training data for improving the agent's prompts, context, and quality criteria. Over time, the correction rate should decline — the agent learns from human feedback, and the automated scoring rubric is refined to catch the patterns humans flag.

1. Sample 10% Weekly Random selection, stratified by agent and task type. The sample must be representative — do not only review outputs from the highest-volume agent.
2. Structured Evaluation Reviewers use a form with specific questions: factual accuracy, actionability, tone appropriateness, and "would you send this as-is?" The form produces data, not opinions.
3. Feed Back Corrections Every correction becomes a prompt refinement or a new behavioral test case. The human evaluation loop is not just quality assurance — it is the training loop for continuous improvement.