Agent Observability: How to Trace and Debug AI Agents with OpenTelemetry

June 23, 2026 · Agents at Scale: The 2026 Frontier (part 5)

▶ Watch on YouTube & subscribe to The Stack Underflow

Every agent team hits the same wall eventually. The dashboard is green. Every LLM call returned 200. Every tool executed without error. Latency was fine. And the user is staring at output that is confidently, completely wrong.

Without observability, you guess what went wrong. With observability, you open a trace and read what actually happened. Those are not the same job — and in 2026, the tooling finally exists to do this properly.

The one-sentence version: Status codes tell you whether requests succeeded; traces tell you whether your agent did the right thing — and in agent systems, those two things are not the same.

The Core Problem: Agents Fail in Ways That Look Like Success

Traditional monitoring is built around a simple contract: if the request returned 200 and latency was within budget, everything is fine. That contract breaks completely in agentic systems.

A model can return a well-formed, grammatically perfect answer that is factually wrong. A tool can execute successfully with the wrong arguments. A handoff between agents can pass corrupted or stale state, and both sides will happily report success. The pipeline stays green while the output is garbage.

This is the fundamental observability gap in agent systems:

Success = the infrastructure did not error
Correctness = the agent did the right thing

Logs and status codes measure success. You need traces to measure correctness.

What to Actually Trace: Four Layers, One Standard

In 2026, the field converged on a vendor-neutral standard: OpenTelemetry GenAI Semantic Conventions, developed by the OTel GenAI Special Interest Group starting in April 2024. The client-side span conventions are stabilizing, and major observability platforms — Datadog, Honeycomb, New Relic, Grafana — support them natively. Major frameworks — LangChain, CrewAI, the OpenAI Agents SDK — emit OTel-compliant spans out of the box.

The payoff: the same span names, the same attribute vocabulary, every backend, every framework. One queryable tree per agent run.

A well-instrumented agent trace looks like this:

invoke_agent (root span, ~25s)
├── llm.chat (LLM call span)
├── tool.execute: search_kb (tool span, includes args + return value)
├── tool.execute: write_draft (tool span)
└── agent.handoff: sub_agent_A (handoff span)
    ├── llm.chat (sub-agent LLM call)
    └── tool.execute: send_email (sub-agent tool span)

The four required layers are:

Layer	What it captures
LLM client	Model called, tokens used, latency, response
Agent invocation	Entry point, inputs, overall outcome
Tool execution	Tool name, arguments passed in, return value
Workflow / handoffs	State passed between agents, sub-agent trees

Four layers. One vocabulary. If you only instrument LLM calls, you will not be able to debug agent failures — because the bug almost never lives in the LLM call.

The Rule That Separates Teams Who Debug from Teams Who Guess

Here is the insight that changes how you think about instrumentation:

The most informative span is almost never the LLM call.

LLM calls almost always succeed. The model returns something. The bug lives elsewhere:

In handoffs: one agent passed bad or corrupted state to another
In tool I/O: the tool received wrong arguments, or returned something unexpected that the agent misinterpreted
In state mutations: something changed in the shared context just before the wrong answer was produced

The mental model to hold: instrument the handoffs and tool I/O with as much detail as you instrument the LLM calls. More detail, actually — because that is where the truth lives.

Sampling Strategy in Production

You cannot (and should not) keep 100% of every trace in production — storage costs and cardinality issues aside, you will drown in noise. The recommended philosophy:

Log 100% of errors — never sample away a failure
Sample successes — a representative subset is enough for performance analysis
Always trace handoffs, tool I/O, and state mutations in full — these are the high-signal spans; sampling them defeats the purpose

Handling Sensitive Data: Three Layers of Protection

By default, OpenTelemetry GenAI conventions allow capturing message content — prompts and completions — in trace events. In production, you almost certainly do not want raw user data or model outputs sitting in your observability backend.

Three layers of protection, applied in order:

Set capture_message_content = false at the SDK level — this is the first gate; stop sensitive data from ever entering the span
Add a redaction processor at the OTel collector — catches anything that slips through the SDK setting
Lock down access to the trace backend — role-based access, audit logging; assume the first two layers will each fail at least once

Each layer will fail at some point. Use all three.

ASCII Overview: Where the Bug Usually Lives

 Agent Run
 │
 ├─ LLM Call ──────────── Almost always 200 OK. Rarely where the bug is.
 │
 ├─ Tool Execution ─────── Arguments matter. Return values matter.
 │    └─ wrong args ──────── Bug: "tool succeeded" but with garbage input
 │
 └─ Handoff to Sub-Agent ── State passed here. This is where truth lives.
      └─ corrupted state ─── Bug: both sides report success; output is wrong

Common Misconceptions

“A green dashboard means my agent is working correctly.” No. Green metrics mean requests completed without infrastructure errors. Correctness — whether the agent did the right thing — is not captured by status codes or latency.
“Instrumenting LLM calls is enough for agent observability.” It is a start, but the LLM call is the least informative span when debugging agent failures. Tool I/O and handoff spans are where the actual bugs surface.
“I need to build a custom observability layer for agents.” Not in 2026. The OTel GenAI semantic conventions are the standard, and the major frameworks emit compliant spans automatically. Adopt the standard; don’t reinvent it.
“Capturing full message content in traces is fine for debugging.” It is fine in development. In production, it creates significant data privacy and compliance risk. Apply the three-layer redaction strategy before going to production.

Frequently Asked Questions

What exactly are OpenTelemetry GenAI Semantic Conventions? They are a set of standardized attribute names and span structures defined by the OpenTelemetry GenAI Special Interest Group for instrumenting AI/LLM workloads. They give every span a consistent vocabulary — the same attribute key means the same thing in Datadog, Honeycomb, Grafana, or any other OTel-compatible backend. Development began in April 2024 and client spans are stabilizing toward a stable release.

Which frameworks emit OTel-compliant spans automatically? As of 2026, LangChain, CrewAI, and the OpenAI Agents SDK all emit spans that conform to the OTel GenAI conventions. You get structured, queryable traces with minimal manual instrumentation.

What should I put in a handoff span? At minimum: the sending agent’s identity, the receiving agent’s identity, and a snapshot of the state being passed. If the state is large, a content hash or a structured summary is acceptable. The goal is being able to answer “what exactly did agent A give to agent B?” when debugging a failure.

Do I need to instrument every agent run at 100% trace depth? For errors, yes — always capture the full trace. For successful runs, sampling is fine. The exception is handoff spans and tool I/O spans: these should be captured in full even in sampled runs, because they carry the diagnostic signal you need when something goes wrong.

Where This Fits in the Series

This is episode 5 of the Agents at Scale: The 2026 Frontier series inside the How Claude Actually Works course. The previous episodes covered agent architectures, memory, tool use, and multi-agent orchestration. This episode establishes the observability foundation — what you need to see inside a running system before you can trust it in production. The next episode covers harness engineering: how to structure the scaffolding that wraps and coordinates agent runs.

Browse all tutorials to follow the full series in order.

Found this useful? The deep version lives on YouTube — new breakdowns of how AI dev tools actually work, weekly.

Subscribe on YouTube →