OC-301f · Module 2

Automated Output Scoring

3 min read

Automated output scoring uses a second model to evaluate the agent's output against defined quality criteria. The scoring model receives: the original input, the agent's output, and a rubric defining quality dimensions and scoring scales. It returns a structured score for each dimension.

The rubric is the critical design element. A vague rubric ("rate quality 1-5") produces inconsistent scores. A specific rubric ("Rate factual accuracy 1-5 where: 1 = contains false claims, 2 = omits critical facts, 3 = factually correct but incomplete, 4 = factually correct and comprehensive, 5 = factually correct, comprehensive, and adds non-obvious insight") produces consistent, meaningful scores. Design rubrics with behavioral anchors at every level of the scale — the evaluator must be able to distinguish between a 3 and a 4 without subjective judgment.

interface QualityRubric {
  dimensions: {
    name: string;
    weight: number;    // 0-1, all weights sum to 1
    scale: {
      score: number;   // 1-5
      anchor: string;  // behavioral description
    }[];
  }[];
}

const analysisRubric: QualityRubric = {
  dimensions: [
    {
      name: 'factual_accuracy',
      weight: 0.3,
      scale: [
        { score: 1, anchor: 'Contains false claims' },
        { score: 3, anchor: 'Correct but incomplete' },
        { score: 5, anchor: 'Correct, comprehensive, insightful' },
      ],
    },
    {
      name: 'format_compliance',
      weight: 0.2,
      scale: [
        { score: 1, anchor: 'Does not follow format spec' },
        { score: 3, anchor: 'Follows structure, minor deviations' },
        { score: 5, anchor: 'Exact compliance with all format rules' },
      ],
    },
  ],
};