RC-401b · Module 3
Scaling & Performance
4 min read
Scaling agent operations is not the same as scaling web services. You cannot just add more instances behind a load balancer. Each agent carries state — context windows, conversation history, learned patterns, inter-agent relationships. Scaling means replicating all of that state correctly, or designing your architecture so that state does not need to be replicated.
The AT track teaches horizontal scaling patterns: shard work across independent teams rather than growing a single team. Keep teams at 3-5 agents (the coordination sweet spot) and run multiple teams in parallel rather than one team of 15. The probability math alone justifies this — three teams of 5 at 0.95^5 = 77% all-succeed each is dramatically better than one team of 15 at 0.95^15 = 46%. CC performance optimization reduces per-agent cost: right-size models (Sonnet for specialists, Opus for leads), keep CLAUDE.md lean, use /compact when context grows, and terminate idle agents immediately. OC performance tuning optimizes the infrastructure layer: session management, cron scheduling, multi-model routing to minimize latency and maximize throughput.
- Shard, Don't Grow When workload increases, resist the urge to add agents to an existing team. Instead, spin up a new independent team with its own lead and specialists. Teams of 3-5 agents maintain 77-86% all-succeed probability. Teams of 10+ drop below 60%. Multiple small teams with a meta-orchestrator outperform a single large team on every metric: cost, reliability, latency, and output quality.
- Right-Size Models Per Agent Not every agent needs the most capable model. Research agents doing structured data extraction run on Sonnet at one-third the cost of Opus with comparable results. Critic agents that evaluate against defined rubrics run on Sonnet. Only leads — agents that synthesize, judge, and make routing decisions — justify Opus. Run /cost during development to measure actual token consumption per agent role. Let the data drive your model assignment, not intuition.
- Optimize Context and Lifecycle Three performance levers you control directly. First, keep CLAUDE.md under 300 lines — bloated project context inflates every agent's baseline token cost. Second, use /compact when context windows grow beyond 60% capacity — this compresses history without losing critical state. Third, implement aggressive lifecycle management: terminate agents the moment their deliverable is approved. One user left background agents running for 12 hours after a 2-hour task and burned over 10,000 unnecessary tokens. Every idle agent is wasted spend.