OC-301f · Module 2
Automated Output Scoring
3 min read
Automated output scoring uses a second model to evaluate the agent's output against defined quality criteria. The scoring model receives: the original input, the agent's output, and a rubric defining quality dimensions and scoring scales. It returns a structured score for each dimension.
The rubric is the critical design element. A vague rubric ("rate quality 1-5") produces inconsistent scores. A specific rubric ("Rate factual accuracy 1-5 where: 1 = contains false claims, 2 = omits critical facts, 3 = factually correct but incomplete, 4 = factually correct and comprehensive, 5 = factually correct, comprehensive, and adds non-obvious insight") produces consistent, meaningful scores. Design rubrics with behavioral anchors at every level of the scale — the evaluator must be able to distinguish between a 3 and a 4 without subjective judgment.
interface QualityRubric {
dimensions: {
name: string;
weight: number; // 0-1, all weights sum to 1
scale: {
score: number; // 1-5
anchor: string; // behavioral description
}[];
}[];
}
const analysisRubric: QualityRubric = {
dimensions: [
{
name: 'factual_accuracy',
weight: 0.3,
scale: [
{ score: 1, anchor: 'Contains false claims' },
{ score: 3, anchor: 'Correct but incomplete' },
{ score: 5, anchor: 'Correct, comprehensive, insightful' },
],
},
{
name: 'format_compliance',
weight: 0.2,
scale: [
{ score: 1, anchor: 'Does not follow format spec' },
{ score: 3, anchor: 'Follows structure, minor deviations' },
{ score: 5, anchor: 'Exact compliance with all format rules' },
],
},
],
};