If you’ve been tinkering with advanced multi-agent orchestration in the OpenAI Agents SDK, you’ve probably come across its built-in “Tracing” system—an infrastructure for capturing every micro-step of your agents’ reasoning. Tracing is critical in modern AI setups, especially when you have:
But how exactly does the tracing system weave itself into your agent runs, and how can you exploit it for maximum debugging power?
Below, we’ll rip the hood open on the Agents SDK’s tracing modules, discussing:
We’ll also highlight relevant resources for each domain: from distributed tracing references to LLM instrumentation papers to more specialized reading on agent orchestration.
Modern LLM-based systems often follow a pattern of multi-step reasoning—what some papers call Chain-of-Thought (CoT) or more advanced frameworks like ReAct. Now add multiple agents (specialists for math, code, triage, etc.) handing the conversation around, plus a variety of tools (web search, function calls), plus guardrails that can intercept or terminate the flow if something suspicious arises—and you have a fairly dynamic environment.
Tracing—the ability to observe each decision, tool call, handoff, or guardrail tripwire—becomes your only reliable way to see “inside” the emergent reasoning of these LLM-based processes. Without robust instrumentation, debugging a multi-agent workflow is like trying to read a novel with only the final page in front of you—impossible to tell how the system arrived at its conclusion (or failed to).
In other words, tracing is your observability tool. Observability is a concept from DevOps and micro-services (check out OpenTelemetry’s approach), but it also maps onto AI agent flows beautifully. The Agents SDK piggybacks on that pattern:
The Agents SDK’s tracing subsystem:
Runner.run(...)
) if a trace isn’t already active.AgentSpanData
for an agent call, FunctionSpanData
for a tool, etc.).This design is reminiscent of typical distributed tracing (Jaeger/Zipkin) in micro-services: each service is a “span,” correlated via a “trace ID.” Here, each agent or tool call is the “span,” correlated by a top-level “trace ID” that’s unique to the entire user query or workflow. This is done not for a cluster of micro-services but for the micro-steps of LLM-based reasoning.
The main code is in src/agents/tracing
. Let’s break down the key files:
__init__.py
add_trace_processor(default_processor())
.trace()
, agent_span()
, guardrail_span()
, and so on.create.py
trace(...)
: start or get the current traceagent_span(...)
, function_span(...)
, generation_span(...)
, etc.GLOBAL_TRACE_PROVIDER.create_span(...)
.setup.py
GLOBAL_TRACE_PROVIDER
(singleton TraceProvider
).TraceProvider
has methods create_trace(...)
and create_span(...)
.spans.py
/ traces.py
SpanImpl
and NoOpSpan
for spans,TraceImpl
and NoOpTrace
for the top-level trace..start()
/ .finish()
or .export()
methods, hooking into the process that logs the data.processor_interface.py
/ processors.py
TracingProcessor
defines abstract methods on_trace_start
, on_trace_end
, on_span_start
, on_span_end
.BatchTraceProcessor
is the real default: it queues up data, sends in the background.BackendSpanExporter
hits the OpenAI ingestion endpoint. If you want local JSON logs, you can override that.span_data.py
AgentSpanData(name, tools, handoffs,...)
, FunctionSpanData(name, input, output)
, etc..export()
method returns the dict you eventually see in final logs.util.py
gen_trace_id()
, time_iso()
for timestamps, etc.).Often, if you do:
from agents.tracing import trace
with trace(workflow_name="CustomerSupport") as t:
result = Runner.run(agent, "My question here.")
trace()
checks if a trace is active. If not, it calls GLOBAL_TRACE_PROVIDER.create_trace(...)
, giving you a TraceImpl
. This sets up _trace_id = "trace_<uuid>"
, _workflow_name
, _group_id
, etc. Once started, it’s put in a contextvar
so all subsequent code sees the same trace as “current.”
Inside that run, the SDK code (or your code) might do:
with agent_span(name="TriageAgent") as span:
# Triage agent LLM call...
Under the hood, agent_span()
calls TraceProvider.create_span(span_data=AgentSpanData(...))
. If _disabled
is false, we return a SpanImpl
; if disabled, a NoOpSpan
. The real SpanImpl
sets parent_id
to the current span (if any) or sets it to None
if it’s top-level in the trace.
As soon as we enter the context, we do:
span.start(mark_as_current=True)
, setting that span as the current span in a context variable.If the LLM fails or we record an error, we might do span.set_error(...)
. Then, on exit, we do span.finish()
, calling the processor.on_span_end(self)
, which queue-puts it for export.
If a triage agent does a handoff, the code might create a handoff_span(...)
; if it calls a tool, it might create a function_span(...)
. Each is just a specialized typed approach to the same flow: we gather metadata in HandoffSpanData
or FunctionSpanData
.
Finally, when we exit the entire with trace(...):
, we do trace.finish()
, calling on_trace_end(...)
, queueing that final trace object. Then the background thread in BatchTraceProcessor
lumps all queued spans + trace objects, and calls BackendSpanExporter.export(...)
. That does an HTTP POST to https://api.openai.com/v1/traces/ingest
with JSON like:
{
"data": [
{
"object": "trace",
"id": "trace_abc123",
"workflow_name": "CustomerSupport",
"group_id": null,
"metadata": {}
},
{
"object": "trace.span",
"id": "span_xyz987",
"trace_id": "trace_abc123",
"started_at": "...",
"ended_at": "...",
"span_data": { ... }
},
...
]
}
If the request is successful, the data eventually surfaces in the OpenAI Traces Dashboard. If you want to see more detail in your logs, you can also attach your own TracingProcessor
.
From the perspective of your code or the built-in Runner
:
Runner.run()
will automatically call trace()
, ensuring each run is traced. If a trace is already active, it just uses that.agent_span(...)
. The system might generate sub-spans for each tool call (function) with function_span(...)
.guardrail_span(...)
. This records the name and whether it triggered or not.Hence, your final trace might look like:
<Trace: _trace_id="trace_1234", name="CustomerSupport">
<Span #1: agent “TriageAgent”, data about tools available>
<Span #2: tool call “search_order_db”>
<Span #3: agent “BillingAgent”>
<Span #4: guardrail “refund_request_check” triggered an error>
Everything lines up so you have a chronological record from start to finish.
You can globally kill tracing with:
from agents.tracing import set_tracing_disabled
set_tracing_disabled(True)
This yields NoOpTrace
and NoOpSpan
. Nothing is recorded, no overhead beyond minimal condition checks.
You can set run_config.trace_include_sensitive_data=False
, so input prompts and outputs aren’t stored in the span data. This ensures user data is not posted to the tracing endpoint if you have privacy concerns.
Use:
from agents.tracing import TracingProcessor, set_trace_processors
class PrintEverythingProcessor(TracingProcessor):
def on_trace_start(self, trace):
print("Trace started:", trace.export())
def on_trace_end(self, trace):
print("Trace ended:", trace.export())
def on_span_start(self, span):
pass
def on_span_end(self, span):
print("Span ended:", span.export())
def shutdown(self):
pass
def force_flush(self):
pass
set_trace_processors([PrintEverythingProcessor()])
Now everything is dumped to stdout
. That’s it—no more calls to the default backend. Perfect if you want to stash data in your own DB or skip cloud export.
src/agents/tracing
folder for reference.Tracing in the OpenAI Agents SDK is essentially your Swiss Army knife for diagnosing multi-agent, tool-heavy flows. You get:
In a world where LLMs can do sophisticated multi-step actions—some correct, some not—tracing is how you pin down exactly why an agent chose to call that tool or why it decided to hand off to another agent. For advanced or production-level usage, you can integrate these logs with your own analytics, build “agent dashboards,” or run offline evaluations.
So next time you see a cryptic multi-agent glitch, pop open the trace logs. You’ll see the chain-of-thought spelled out in terms of spans, sub-calls, guardrail outcomes, etc. That clarity is exactly what you need to make sense of these emergent behaviors. Happy tracing!