OC-301g · Module 1

Distributed Tracing for Agent Workflows

3 min read

A multi-agent workflow spans multiple agents, multiple services, and potentially multiple external APIs. When something goes wrong — the output is late, the quality is low, or an error occurs — you need to trace the request through every component it touched. Distributed tracing assigns a unique trace ID at the workflow entry point and propagates it through every agent handoff, service call, and data access.

The trace structure for agent workflows has a unique element: decision spans. Traditional traces have spans for service calls (database query: 45ms, API call: 200ms). Agent traces add spans for decision-making (agent deliberation: 1200ms, council vote: 3400ms). Decision spans capture not just the duration but the decision record — what the agent considered and concluded. This enables performance analysis (where is the bottleneck?) and quality analysis (which decision step produced the error?) from the same trace.

interface AgentTraceSpan {
  traceId: string;         // propagated through workflow
  spanId: string;          // unique to this span
  parentSpanId?: string;   // for nested spans
  agentId: string;
  spanType: 'task' | 'decision' | 'api_call' | 'council';
  startTime: string;
  endTime: string;
  durationMs: number;
  // Decision-specific fields
  decision?: {
    trigger: string;
    alternativesConsidered: number;
    confidence: number;
    selectedAction: string;
  };
  // Error fields
  error?: {
    code: string;
    message: string;
    recoverable: boolean;
  };
}