AT-301h · Module 2

State Reconstruction

4 min read

Intermittent failures require state reconstruction — understanding what the system looked like at the exact moment the failure occurred. A failure that only happens on Thursdays, or only when three specific agents are busy simultaneously, or only when the pipeline processes more than 12 tasks per hour — these failures cannot be debugged from current state alone.

State reconstruction collects four data sources. Agent state logs: what each agent was doing at the failure timestamp — active tasks, queue depth, resource utilization. Message logs: what messages were in flight, pending, or recently delivered at the failure timestamp. Context snapshots: what data each agent had in its context window at the failure point. System metrics: throughput, latency, error rates plotted at 1-minute granularity around the failure window.

The reconstruction produces a system-wide snapshot: "At 14:23:17, HUNTER was processing 4 leads simultaneously, CLOSER had 3 active deals in context, CIPHER was mid-analysis on a large dataset consuming 87% of available context window, and a priority conflict between BLITZ and QUILL had just been escalated." This snapshot reveals the conditions that enabled the failure — conditions that do not exist during normal operations, which is why the failure is intermittent.