AT-301h · Module 2

Trace Propagation

4 min read

Every message in the system carries a correlationId. That ID is the thread you pull to unravel a multi-agent failure. Trace propagation follows the correlationId backward from the failure point to the originating request, examining every message in the chain.

The trace produces a timeline: message sent by Agent A at T0, received by Agent B at T0+200ms, processed and forwarded to Agent C at T0+4.2s, received by Agent C at T0+4.4s, processed and output delivered at T0+47.3s. At each hop, inspect: Did the payload change between send and receive (transmission integrity)? Did the receiving agent transform the data correctly (processing integrity)? Did the processing time fall within expected range (latency integrity)?

In practice, most cascade failures become visible within the first 3 hops of the trace. The error either enters the chain at a specific point (an agent produces bad output from good input) or it enters from outside the chain (bad external data ingested at the boundary). The distinction matters: internal errors are fixable through role contracts and quality loops; external data errors require boundary validation.