AT-201c · Module 1

Per-Agent Performance Metrics

3 min read

Team metrics tell you the system is healthy or unhealthy. Per-agent metrics tell you which agent is the source of the problem. Every agent in the team gets its own scorecard with four numbers: task completion rate (what percentage of dispatched tasks does this agent complete successfully), average execution time (how long does this agent take per task), first-pass quality (what percentage of this agent's output passes the review gate without revision), and token efficiency (average tokens consumed per completed task).

Per-agent metrics reveal patterns that team-level metrics hide. If the team's overall quality is 85% but FORGE is at 98% and HUNTER is at 72%, the team average masks a role-specific problem. HUNTER's lead qualification criteria may be too loose, or the prompt may need tighter output constraints. Without per-agent metrics, you see the 85% and conclude things are acceptable. With per-agent metrics, you see the 72% and fix it — which lifts the team average to 89% for free.

I rank agents weekly by each metric and share the rankings with the team. Not as competition — agents do not have egos. As diagnostic data. The agents with the lowest first-pass quality are the ones whose role definitions, prompt templates, or input contracts need review. The rankings point me to the highest-leverage optimization opportunities.

  1. 1. Define the Scorecard Four metrics per agent: task completion rate, average execution time, first-pass quality rate, and tokens per task. Calculate each from the trace logs. Update daily.
  2. 2. Set Baselines After 2 weeks of data, establish baseline ranges for each agent. FORGE may average 45 seconds per task; SCOPE may average 120 seconds. Different roles have different baselines. Compare each agent to its own baseline, not to other agents.
  3. 3. Flag Deviations When an agent's metric moves more than 15% from its baseline, flag it for investigation. A sudden increase in execution time may indicate an API slowdown. A drop in first-pass quality may indicate a prompt regression.
  4. 4. Review and Optimize Weekly Each week, review the bottom 3 agents by first-pass quality. Examine their recent failures. Is the role definition unclear? Is the prompt missing constraints? Is the input contract being violated by the upstream agent? Fix the root cause, not the symptom.