Agent Observability: How to Trace and Debug AI Agents with OpenTelemetry
▶ Watch on YouTube & subscribe to The Stack Underflow
Every agent team hits the same wall eventually. The dashboard is green. Every LLM call returned 200. Every tool executed without error. Latency was fine. And the user is staring at output that is confidently, completely wrong.
Without observability, you guess what went wrong. With observability, you open a trace and read what actually happened. Those are not the same job — and in 2026, the tooling finally exists to do this properly.
The one-sentence version: Status codes tell you whether requests succeeded; traces tell you whether your agent did the right thing — and in agent systems, those two things are not the same.
The Core Problem: Agents Fail in Ways That Look Like Success
Traditional monitoring is built around a simple contract: if the request returned 200 and latency was within budget, everything is fine. That contract breaks completely in agentic systems.
A model can return a well-formed, grammatically perfect answer that is factually wrong. A tool can execute successfully with the wrong arguments. A handoff between agents can pass corrupted or stale state, and both sides will happily report success. The pipeline stays green while the output is garbage.
This is the fundamental observability gap in agent systems:
- Success = the infrastructure did not error
- Correctness = the agent did the right thing
Logs and status codes measure success. You need traces to measure correctness.
What to Actually Trace: Four Layers, One Standard
In 2026, the field converged on a vendor-neutral standard: OpenTelemetry GenAI Semantic Conventions, developed by the OTel GenAI Special Interest Group starting in April 2024. The client-side span conventions are stabilizing, and major observability platforms — Datadog, Honeycomb, New Relic, Grafana — support them natively. Major frameworks — LangChain, CrewAI, the OpenAI Agents SDK — emit OTel-compliant spans out of the box.
The payoff: the same span names, the same attribute vocabulary, every backend, every framework. One queryable tree per agent run.
A well-instrumented agent trace looks like this:
invoke_agent (root span, ~25s)
├── llm.chat (LLM call span)
├── tool.execute: search_kb (tool span, includes args + return value)
├── tool.execute: write_draft (tool span)
└── agent.handoff: sub_agent_A (handoff span)
├── llm.chat (sub-agent LLM call)
└── tool.execute: send_email (sub-agent tool span)
The four required layers are:
| Layer | What it captures |
|---|---|
| LLM client | Model called, tokens used, latency, response |
| Agent invocation | Entry point, inputs, overall outcome |
| Tool execution | Tool name, arguments passed in, return value |
| Workflow / handoffs | State passed between agents, sub-agent trees |
Four layers. One vocabulary. If you only instrument LLM calls, you will not be able to debug agent failures — because the bug almost never lives in the LLM call.
The Rule That Separates Teams Who Debug from Teams Who Guess
Here is the insight that changes how you think about instrumentation:
The most informative span is almost never the LLM call.
LLM calls almost always succeed. The model returns something. The bug lives elsewhere:
- In handoffs: one agent passed bad or corrupted state to another
- In tool I/O: the tool received wrong arguments, or returned something unexpected that the agent misinterpreted
- In state mutations: something changed in the shared context just before the wrong answer was produced
The mental model to hold: instrument the handoffs and tool I/O with as much detail as you instrument the LLM calls. More detail, actually — because that is where the truth lives.
Sampling Strategy in Production
You cannot (and should not) keep 100% of every trace in production — storage costs and cardinality issues aside, you will drown in noise. The recommended philosophy:
- Log 100% of errors — never sample away a failure
- Sample successes — a representative subset is enough for performance analysis
- Always trace handoffs, tool I/O, and state mutations in full — these are the high-signal spans; sampling them defeats the purpose
Handling Sensitive Data: Three Layers of Protection
By default, OpenTelemetry GenAI conventions allow capturing message content — prompts and completions — in trace events. In production, you almost certainly do not want raw user data or model outputs sitting in your observability backend.
Three layers of protection, applied in order:
- Set
capture_message_content = falseat the SDK level — this is the first gate; stop sensitive data from ever entering the span - Add a redaction processor at the OTel collector — catches anything that slips through the SDK setting
- Lock down access to the trace backend — role-based access, audit logging; assume the first two layers will each fail at least once
Each layer will fail at some point. Use all three.
ASCII Overview: Where the Bug Usually Lives
Agent Run
│
├─ LLM Call ──────────── Almost always 200 OK. Rarely where the bug is.
│
├─ Tool Execution ─────── Arguments matter. Return values matter.
│ └─ wrong args ──────── Bug: "tool succeeded" but with garbage input
│
└─ Handoff to Sub-Agent ── State passed here. This is where truth lives.
└─ corrupted state ─── Bug: both sides report success; output is wrong
Common Misconceptions
-
“A green dashboard means my agent is working correctly.” No. Green metrics mean requests completed without infrastructure errors. Correctness — whether the agent did the right thing — is not captured by status codes or latency.
-
“Instrumenting LLM calls is enough for agent observability.” It is a start, but the LLM call is the least informative span when debugging agent failures. Tool I/O and handoff spans are where the actual bugs surface.
-
“I need to build a custom observability layer for agents.” Not in 2026. The OTel GenAI semantic conventions are the standard, and the major frameworks emit compliant spans automatically. Adopt the standard; don’t reinvent it.
-
“Capturing full message content in traces is fine for debugging.” It is fine in development. In production, it creates significant data privacy and compliance risk. Apply the three-layer redaction strategy before going to production.
Frequently Asked Questions
What exactly are OpenTelemetry GenAI Semantic Conventions? They are a set of standardized attribute names and span structures defined by the OpenTelemetry GenAI Special Interest Group for instrumenting AI/LLM workloads. They give every span a consistent vocabulary — the same attribute key means the same thing in Datadog, Honeycomb, Grafana, or any other OTel-compatible backend. Development began in April 2024 and client spans are stabilizing toward a stable release.
Which frameworks emit OTel-compliant spans automatically? As of 2026, LangChain, CrewAI, and the OpenAI Agents SDK all emit spans that conform to the OTel GenAI conventions. You get structured, queryable traces with minimal manual instrumentation.
What should I put in a handoff span? At minimum: the sending agent’s identity, the receiving agent’s identity, and a snapshot of the state being passed. If the state is large, a content hash or a structured summary is acceptable. The goal is being able to answer “what exactly did agent A give to agent B?” when debugging a failure.
Do I need to instrument every agent run at 100% trace depth? For errors, yes — always capture the full trace. For successful runs, sampling is fine. The exception is handoff spans and tool I/O spans: these should be captured in full even in sampled runs, because they carry the diagnostic signal you need when something goes wrong.
Where This Fits in the Series
This is episode 5 of the Agents at Scale: The 2026 Frontier series inside the How Claude Actually Works course. The previous episodes covered agent architectures, memory, tool use, and multi-agent orchestration. This episode establishes the observability foundation — what you need to see inside a running system before you can trust it in production. The next episode covers harness engineering: how to structure the scaffolding that wraps and coordinates agent runs.
Browse all tutorials to follow the full series in order.
Found this useful? The deep version lives on YouTube — new breakdowns of how AI dev tools actually work, weekly.
Subscribe on YouTube →