OC-201c · Module 2

Performance Monitoring

3 min read

Performance degradation is slow. Your daily briefing skill used to complete in 8 seconds. This week it takes 22 seconds. Nobody noticed because it still works — the message arrives, the data is correct, the user is happy. But the latency tripled. The weather API started rate-limiting you because your free tier quota dropped. The calendar API introduced pagination that your code handles but doubles the request count. The AI model response time increased because the provider is under load. No single change broke anything. But the system is slower, more expensive, and closer to a timeout failure than it was a month ago.

Performance monitoring tracks three metrics per module: latency (how long did the module take to execute), cost (how much did the API calls cost), and reliability (what percentage of executions completed without errors). Store these metrics as time-series data — one data point per execution. Plot trends weekly. When latency doubles or reliability drops below 95%, investigate before the gradual degradation becomes an outage.

Do This

  • Record execution time, API cost, and success/failure for every module run
  • Store metrics as time-series data and plot weekly trends
  • Set alert thresholds: latency > 2x baseline, reliability < 95%, cost > 2x average
  • Investigate gradual degradation before it becomes a sudden failure

Avoid This

  • Measure performance only when something feels slow
  • Store only the most recent metric — you need the trend, not the snapshot
  • Ignore cost metrics because "it is only a few dollars" — small leaks compound
  • Wait for a timeout failure to discover that latency has been climbing for weeks