Tracing in Multi-Agent Systems

March 11, 2025 • OpenAI, Agentic, Agents, SDK, Artificial Intelligence, AI, Software, Tracing, Debugging, Development • 2130 words • about a 10 minute read

If you’ve been tinkering with advanced multi-agent orchestration in the OpenAI Agents SDK, you’ve probably come across its built-in “Tracing” system—an infrastructure for capturing every micro-step of your agents’ reasoning. Tracing is critical in modern AI setups, especially when you have:

Multiple agents passing conversations back and forth,
Tools being invoked for ephemeral computations,
Guardrails checking compliance,
Parallel tasks or partial streaming.

But how exactly does the tracing system weave itself into your agent runs, and how can you exploit it for maximum debugging power?

Below, we’ll rip the hood open on the Agents SDK’s tracing modules, discussing:

Why Tracing Is Crucial for Complex AI Systems
Overall Architecture of the SDK’s Tracing Layer
Code-Level Walkthrough (Key Classes, Files, and Logic)
Detailed Lifecycle of a Trace and Spans
Integration with Agents and Tools
Customizing or Disabling Tracing
Links & Further Reading

We’ll also highlight relevant resources for each domain: from distributed tracing references to LLM instrumentation papers to more specialized reading on agent orchestration.

1. Why Tracing Is Crucial

Modern LLM-based systems often follow a pattern of multi-step reasoning—what some papers call Chain-of-Thought (CoT) or more advanced frameworks like ReAct. Now add multiple agents (specialists for math, code, triage, etc.) handing the conversation around, plus a variety of tools (web search, function calls), plus guardrails that can intercept or terminate the flow if something suspicious arises—and you have a fairly dynamic environment.

Tracing—the ability to observe each decision, tool call, handoff, or guardrail tripwire—becomes your only reliable way to see “inside” the emergent reasoning of these LLM-based processes. Without robust instrumentation, debugging a multi-agent workflow is like trying to read a novel with only the final page in front of you—impossible to tell how the system arrived at its conclusion (or failed to).

In other words, tracing is your observability tool. Observability is a concept from DevOps and micro-services (check out OpenTelemetry’s approach), but it also maps onto AI agent flows beautifully. The Agents SDK piggybacks on that pattern:

Record each step as a span,
Correlate them in a trace,
Potentially visualize or store them in a specialized dashboard.

2. Architecture Overview

The Agents SDK’s tracing subsystem:

Creates a global “Trace” whenever you run an agent (e.g., Runner.run(...)) if a trace isn’t already active.
For each operation (agent invocation, tool call, guardrail check, etc.), it spawns a Span.
Each Span captures timestamps, error states, typed data (like AgentSpanData for an agent call, FunctionSpanData for a tool, etc.).
A TraceProvider orchestrates creation of these spans and passing them to a Processor, which holds them in a queue.
A BatchTraceProcessor ultimately exports them in batches to the OpenAI “Traces” API (or to your own system, if you override it).

This design is reminiscent of typical distributed tracing (Jaeger/Zipkin) in micro-services: each service is a “span,” correlated via a “trace ID.” Here, each agent or tool call is the “span,” correlated by a top-level “trace ID” that’s unique to the entire user query or workflow. This is done not for a cluster of micro-services but for the micro-steps of LLM-based reasoning.

3. Code-Level Walkthrough

The main code is in src/agents/tracing. Let’s break down the key files:

`init.py`

Wires up default processors: add_trace_processor(default_processor()).
Exports the top-level functions like trace(), agent_span(), guardrail_span(), and so on.

`create.py`

Defines those user-facing functions:
- trace(...): start or get the current trace
- agent_span(...), function_span(...), generation_span(...), etc.
- Internally calls GLOBAL_TRACE_PROVIDER.create_span(...).

`setup.py`

Houses the GLOBAL_TRACE_PROVIDER (singleton TraceProvider).
TraceProvider has methods create_trace(...) and create_span(...).
Also sets or registers the default tracing “processors,” which handle concurrency, batch exporting, etc.

`spans.py` / `traces.py`

Concrete classes:
- SpanImpl and NoOpSpan for spans,
- TraceImpl and NoOpTrace for the top-level trace.
Each class has .start() / .finish() or .export() methods, hooking into the process that logs the data.

`processor_interface.py` / `processors.py`

TracingProcessor defines abstract methods on_trace_start, on_trace_end, on_span_start, on_span_end.
BatchTraceProcessor is the real default: it queues up data, sends in the background.
BackendSpanExporter hits the OpenAI ingestion endpoint. If you want local JSON logs, you can override that.

`span_data.py`

Various typed “SpanData” classes, e.g. AgentSpanData(name, tools, handoffs,...), FunctionSpanData(name, input, output), etc.
Each .export() method returns the dict you eventually see in final logs.

`util.py`

Utility functions (like gen_trace_id(), time_iso() for timestamps, etc.).

4. Detailed Lifecycle

(A) Starting a Trace

Often, if you do:

from agents.tracing import trace

with trace(workflow_name="CustomerSupport") as t:
    result = Runner.run(agent, "My question here.")

trace() checks if a trace is active. If not, it calls GLOBAL_TRACE_PROVIDER.create_trace(...), giving you a TraceImpl. This sets up _trace_id = "trace_<uuid>", _workflow_name, _group_id, etc. Once started, it’s put in a contextvar so all subsequent code sees the same trace as “current.”

(B) Creating Spans

Inside that run, the SDK code (or your code) might do:

with agent_span(name="TriageAgent") as span:
    # Triage agent LLM call...

Under the hood, agent_span() calls TraceProvider.create_span(span_data=AgentSpanData(...)). If _disabled is false, we return a SpanImpl; if disabled, a NoOpSpan. The real SpanImpl sets parent_id to the current span (if any) or sets it to None if it’s top-level in the trace.

As soon as we enter the context, we do:

span.start(mark_as_current=True), setting that span as the current span in a context variable.

If the LLM fails or we record an error, we might do span.set_error(...). Then, on exit, we do span.finish(), calling the processor.on_span_end(self), which queue-puts it for export.

(C) Handoffs and Tools

If a triage agent does a handoff, the code might create a handoff_span(...); if it calls a tool, it might create a function_span(...). Each is just a specialized typed approach to the same flow: we gather metadata in HandoffSpanData or FunctionSpanData.

(D) End

Finally, when we exit the entire with trace(...):, we do trace.finish(), calling on_trace_end(...), queueing that final trace object. Then the background thread in BatchTraceProcessor lumps all queued spans + trace objects, and calls BackendSpanExporter.export(...). That does an HTTP POST to https://api.openai.com/v1/traces/ingest with JSON like:

{
  "data": [
    {
      "object": "trace",
      "id": "trace_abc123",
      "workflow_name": "CustomerSupport",
      "group_id": null,
      "metadata": {}
    },
    {
      "object": "trace.span",
      "id": "span_xyz987",
      "trace_id": "trace_abc123",
      "started_at": "...",
      "ended_at": "...",
      "span_data": { ... }
    },
    ...
  ]
}

If the request is successful, the data eventually surfaces in the OpenAI Traces Dashboard. If you want to see more detail in your logs, you can also attach your own TracingProcessor.

5. Integration with Agents and Tools {#5-integration-with-agents-and-tools}

From the perspective of your code or the built-in Runner:

Runner.run() will automatically call trace(), ensuring each run is traced. If a trace is already active, it just uses that.
Each agent call is bracketed by an agent_span(...). The system might generate sub-spans for each tool call (function) with function_span(...).
If guardrails are used, we do guardrail_span(...). This records the name and whether it triggered or not.

Hence, your final trace might look like:

<Trace: _trace_id="trace_1234", name="CustomerSupport">
- <Span #1: agent “TriageAgent”, data about tools available>
- <Span #2: tool call “search_order_db”>
- <Span #3: agent “BillingAgent”>
- <Span #4: guardrail “refund_request_check” triggered an error>

Everything lines up so you have a chronological record from start to finish.

6. Customizing or Disabling

Disabling

You can globally kill tracing with:

from agents.tracing import set_tracing_disabled
set_tracing_disabled(True)

This yields NoOpTrace and NoOpSpan. Nothing is recorded, no overhead beyond minimal condition checks.

Excluding Sensitive Data

You can set run_config.trace_include_sensitive_data=False, so input prompts and outputs aren’t stored in the span data. This ensures user data is not posted to the tracing endpoint if you have privacy concerns.

Overriding the Exporter

Use:

from agents.tracing import TracingProcessor, set_trace_processors

class PrintEverythingProcessor(TracingProcessor):
    def on_trace_start(self, trace):
        print("Trace started:", trace.export())
    def on_trace_end(self, trace):
        print("Trace ended:", trace.export())
    def on_span_start(self, span):
        pass
    def on_span_end(self, span):
        print("Span ended:", span.export())
    def shutdown(self):
        pass
    def force_flush(self):
        pass

set_trace_processors([PrintEverythingProcessor()])

Now everything is dumped to stdout. That’s it—no more calls to the default backend. Perfect if you want to stash data in your own DB or skip cloud export.

7. Links And Further Reading

OpenAI Agents SDK
- GitHub Repo – see src/agents/tracing folder for reference.
- Docs on Tracing – official doc for how to enable, disable, or view it in the dashboard.
Tracing & Observability
- OpenTelemetry.io – standard approach to distributed tracing. The Agents SDK is simpler but conceptually similar.
- Jaeger Docs – if you want to see how typical microservice tracing frameworks handle context, spans, etc.
AI Agent Orchestration
- ReAct Paper – synergy of “Reasoning + Acting” for tool usage, with partial parallels to how these spans are captured.
- Chain-of-Thought Prompting Paper – illustrating multi-step LLM reasoning. Tracing helps you see these steps concretely.
Advanced Usage
- Set Up a Custom Processor Demo (Hypothetical Example) – you might find community gists with examples of hooking the trace output to your own logs or a local DB.
- If you’re building a large multi-agent system, combining tracing with LangChain’s “Callbacks” approach or other instrumentation might yield even more robust insights.

Wrap-Up

Tracing in the OpenAI Agents SDK is essentially your Swiss Army knife for diagnosing multi-agent, tool-heavy flows. You get:

Automated instrumentation out-of-the-box (no need to manually track every step).
Contextual linking so each sub-step’s data (like function arguments, guardrail triggers) is recorded in nested spans.
Batch exporting to the OpenAI tracing service or any custom endpoint you prefer.

In a world where LLMs can do sophisticated multi-step actions—some correct, some not—tracing is how you pin down exactly why an agent chose to call that tool or why it decided to hand off to another agent. For advanced or production-level usage, you can integrate these logs with your own analytics, build “agent dashboards,” or run offline evaluations.

So next time you see a cryptic multi-agent glitch, pop open the trace logs. You’ll see the chain-of-thought spelled out in terms of spans, sub-calls, guardrail outcomes, etc. That clarity is exactly what you need to make sense of these emergent behaviors. Happy tracing!