AT-301g · Module 1

Observability Layers

4 min read

Agent observability operates at three layers, each with different metrics, different granularity, and different alert thresholds.

Layer 1 — Agent Health: is each agent operational? Heartbeat checks every 60 seconds confirm that each agent is responsive. Health includes: availability (up/down), response latency (p50/p90/p99), error rate (percentage of failed task completions), and queue depth (pending tasks). This is the foundation — if an agent is down or degraded, nothing else matters.

Layer 2 — Task Performance: is each agent producing quality output at the expected rate? Throughput (tasks completed per hour), quality scores (from the critique loops), cycle time (average time from task receipt to delivery), and rework rate (percentage of deliverables sent back for revision). This layer catches capability degradation that health monitoring misses — an agent can be "up" while producing substandard work.

Layer 3 — System Coordination: is the team functioning as an integrated unit? Coordination efficiency (the master metric — currently 94.73%), handoff integrity, cross-team quality scores, and end-to-end latency from customer request to deliverable. This layer catches emergent problems that no individual agent metric reveals.