RC-401b · Module 3

Scaling & Performance

4 min read

Scaling agent operations is not the same as scaling web services. You cannot just add more instances behind a load balancer. Each agent carries state — context windows, conversation history, learned patterns, inter-agent relationships. Scaling means replicating all of that state correctly, or designing your architecture so that state does not need to be replicated.

The AT track teaches horizontal scaling patterns: shard work across independent teams rather than growing a single team. Keep teams at 3-5 agents (the coordination sweet spot) and run multiple teams in parallel rather than one team of 15. The probability math alone justifies this — three teams of 5 at 0.95^5 = 77% all-succeed each is dramatically better than one team of 15 at 0.95^15 = 46%. CC performance optimization reduces per-agent cost: right-size models (Sonnet for specialists, Opus for leads), keep CLAUDE.md lean, use /compact when context grows, and terminate idle agents immediately. OC performance tuning optimizes the infrastructure layer: session management, cron scheduling, multi-model routing to minimize latency and maximize throughput.

  1. Shard, Don't Grow When workload increases, resist the urge to add agents to an existing team. Instead, spin up a new independent team with its own lead and specialists. Teams of 3-5 agents maintain 77-86% all-succeed probability. Teams of 10+ drop below 60%. Multiple small teams with a meta-orchestrator outperform a single large team on every metric: cost, reliability, latency, and output quality.
  2. Right-Size Models Per Agent Not every agent needs the most capable model. Research agents doing structured data extraction run on Sonnet at one-third the cost of Opus with comparable results. Critic agents that evaluate against defined rubrics run on Sonnet. Only leads — agents that synthesize, judge, and make routing decisions — justify Opus. Run /cost during development to measure actual token consumption per agent role. Let the data drive your model assignment, not intuition.
  3. Optimize Context and Lifecycle Three performance levers you control directly. First, keep CLAUDE.md under 300 lines — bloated project context inflates every agent's baseline token cost. Second, use /compact when context windows grow beyond 60% capacity — this compresses history without losing critical state. Third, implement aggressive lifecycle management: terminate agents the moment their deliverable is approved. One user left background agents running for 12 hours after a 2-hour task and burned over 10,000 unnecessary tokens. Every idle agent is wasted spend.