AT-201c · Module 2

Trace-Based Debugging

4 min read

In a single-agent system, debugging is straightforward: read the prompt, read the output, identify the mismatch. In a multi-agent system, the failure might originate three handoffs upstream from where it manifests. The research agent returned slightly inaccurate data. The drafting agent incorporated it without questioning. The review agent approved it because the prose was polished. The final deliverable contains a factual error that traces back to step one of a four-step pipeline. Without tracing, you would start debugging at the review stage — the wrong place entirely.

Trace-based debugging follows the trace ID through every handoff in the workflow. Each message, each dispatch, each result carries the same trace ID. When a deliverable has an issue, pull the trace ID and read the entire communication chain: what the coordinator dispatched, what each agent received, what each agent returned, what the coordinator forwarded. The failure point becomes visible because you can see exactly where the information changed from correct to incorrect.

This is why the trace ID field in the message protocol is not optional. Without it, correlating messages across a 20-agent workflow is manual forensics — matching timestamps, guessing which dispatch produced which result, reconstructing the chain by inference. With trace IDs, the chain is explicit. Pull the ID, read the chain, find the failure. Five minutes, not five hours.

1. Identify the Symptom What is wrong with the final output? A factual error, a missing section, an incorrect format, an off-brand tone. The symptom tells you what kind of failure to look for in the trace.
2. Pull the Trace Get the trace ID from the final deliverable and retrieve every message in the chain. Read them in dispatch order — the order the coordinator sent them, not the order they completed.
3. Walk the Chain At each handoff, compare the coordinator's dispatch (what was asked) to the agent's result (what was returned). The first mismatch is either the failure point or the propagation point of an earlier failure.
4. Classify the Failure Is this a prompt failure (ambiguous instructions), a contract failure (output format mismatch), an execution failure (agent error), or a propagation failure (bad data from an upstream agent)? The classification determines the fix: rewrite the prompt, tighten the contract, improve the agent, or fix the upstream source.