OC-301i · Module 1

Latency Analysis & Reduction

3 min read

Latency in agent systems has two components: processing latency (how long the agent takes to produce output) and queue latency (how long the task waits before an agent processes it). Processing latency is bounded by the LLM's response time — you cannot make the model think faster. Queue latency is bounded by your infrastructure — you can add more agents, increase concurrency, or prioritize the queue.

Latency reduction strategies, in order of impact: reduce context size (smaller prompts produce faster responses — remove irrelevant context from the prompt), parallelize independent subtasks (if a task has three independent research steps, run them simultaneously instead of sequentially), implement response caching (if the same query appears frequently, cache the response instead of calling the API again), and optimize queue routing (route tasks to the least-loaded agent instead of the default agent). Each strategy has a different effort-to-impact ratio. Context reduction is the highest impact per effort for most systems.

Do This

  • Reduce context size first — smaller prompts produce faster responses and cost less per call
  • Parallelize independent subtasks — three parallel 10-second tasks complete in 10 seconds, not 30
  • Cache responses for repeated queries — cache hits are milliseconds, API calls are seconds

Avoid This

  • Optimize prompt formatting when the bottleneck is API response time — formatting is microseconds
  • Add more agents when the bottleneck is API rate limits — more agents hitting the same rate limit changes nothing
  • Cache responses for unique queries — cache miss rate near 100% means the cache is storage cost without latency benefit