When the Trace Stops Being Enough: Observability for Multi-Agent Systems

By Ofer Mendelevitch

Jun 3, 20266 min read

Share this post

Single-agent traces tell you what each agent did. Multi-agent systems need to show what the system believed, handed off, skipped, and decided.

It's 2:47am. Your pager fires.

Here's what actually happened.

2:15am Feature flag changes checkout routing
2:31am Unrelated deploy ships
2:47am Error rate hits 6%
2:48am Triage agent blames deploy
2:51am Diagnostic agent finds DB symptoms (connection pool exhausted)
2:55am Remediation agent rolls back deploy
2:57am Comms agent posts "resolved"
2:58am Error rate still 6%

Illustrated incident timeline showing a feature flag toggled at 2:47am, deploy, error spike, rollback, and status update connected by competing signal paths.

Here’s what your agents saw:

The triage agent looked at recent changes, saw the 2:31am deploy, and handed off to the diagnostic agent: “recent deploy, likely cause.”
The diagnostic agent went looking for evidence that the deploy broke something. It found connection pool exhaustion, which often shows up around deploys, and handed off: “DB issue from the deploy, medium confidence.”
The remediation agent rolled back the deploy. Somewhere between agents, “medium confidence” had become “confirmed.”
The comms agent was set to fire when remediation completed, not when the error rate recovered. It posted “Identified and resolved” to the status page.

The error rate is still 6%. The feature flag is still live. Customers are still failing checkout. You're paged in, half-awake, staring at a status page that says everything is fine.

Where do you start looking?

If your honest answer is "grep across four agents' logs and pray the timestamps line up," you're not alone. The reason isn't that you skipped a best practice. The observability stack you have today was built to solve a different problem.

Observability for single agents

If you've shipped an AI agent in the last two years, you know the drill. You wired up Arize Phoenix, Braintrust, LangSmith, or a homegrown OpenTelemetry pipeline. You get spans for every model call, attributes for the prompt and completion, nested children for tool invocations, and a roll-up of tokens, latency, and cost.

It works because the unit of work is well-defined. An LLM call is one node in a tree. One conversation is one tree. Trace propagation is structural: a parent span calls a tool, the tool's span hangs off the parent. The shape of the trace mirrors the shape of the execution.

Then someone wires two agents together. Then four (like the on-call workflow above). Then a dozen. Each one running in its own process, each calling its own tools, each treating the previous agent's conclusion as input.

You now have clean traces and no view of the workflow that connected them. The trace tree has no root - because the workflow isn't a tree.

Berkeley's MAST study annotated over 200 multi-agent traces across seven frameworks. The headline finding wasn't the failure rates on specific benchmarks - those numbers may or may not translate cleanly to production systems - but the shape of the failures: inter-agent misalignment, poor handoff fidelity, and coordination breakdowns showed up repeatedly as a major source of failure. These are failures where every individual span looks fine and the workflow still goes wrong.

Diagram comparing single-agent trace trees with fragmented multi-agent traces

Four assumptions that break across agent boundaries

Four assumptions that worked inside a single agent fail when you have more agents collaborating.

1. Timestamps are not causality

The first thing that breaks is causality. In a single-process trace, happened-before is usually easy to follow: one span calls another, parent-child relationships are explicit, and you can read the trace top-to-bottom with some confidence about what caused what.

In a multi-agent system, the timeline is much less trustworthy as evidence. Each agent sees only part of the system, each tool emits events from its own perspective, and the shared view is often stitched together after the fact. Even when the timestamps are accurate, they mostly tell you what happened nearby in time. They do not tell you what caused what.

Back to 2:47am. The triage agent flagged the deploy because the deploy timestamp landed shortly before the alert. That was a plausible lead, but it was not proof. The deploy, the first failing request, the alert, and the feature flag change were separate events from separate systems. The timeline suggested a story: deploy, then errors. But the real causal chain started earlier, when the feature flag changed checkout routing at 2:15am.

The problem was not that the clocks were wildly wrong. The problem was that every agent downstream inherited the first agent’s interpretation of the timeline. “Recent deploy” became “likely deploy.” “Likely deploy” became “DB issue from deploy.” By the time the remediation agent acted, a weak temporal correlation had hardened into an operational fact.

2. Handoffs aren’t calls

OpenTelemetry's context propagation is structural. When service A calls service B, it stamps a traceparent header on the request. B reads it, opens a child span under A's trace ID, and the call graph becomes a trace tree automatically. The framework works because it was built for exactly this - well-defined calls, predictable propagation, a parent for every child.

With agent to agent collaboration, a handoff isn't a call - it's a transition. When the diagnostic agent finishes and the remediation agent takes over, the actual mechanics could be any of the following:

A message published to a queue that a different orchestrator picks up
A new context object materialized from saved state
A fresh LLM session started with the previous agent's output as input

There's no parent span in any meaningful sense. The relationship between the two agents' work lives in the meaning of what they exchanged (e.g. "I think this is a DB problem from the recent deploy, confidence medium"), not in the structure of how they exchanged it.

The OpenTelemetry community has noticed this gap. The gen_ai.invoke_agent and gen_ai.handoff semantic conventions let you tag agent identity and mark handoff boundaries explicitly.

Within a single framework - LangGraph traced by LangSmith, for instance - this works well. The break happens at the framework boundary. The semantic conventions help standardize the vocabulary, but they do not guarantee consistent capture across independently built agents, different frameworks, queues, orchestration layers, or human-in-the-loop paths. In practice, the handoff metadata is the first thing to break - and it breaks exactly on the failure paths you most need it for.

This is what dropped 'medium confidence' between the diagnostic and remediation agents in our example. There was nowhere structural for that qualifier to live.

Illustrated multi-agent handoff sequence where triage, diagnostic, remediation, and communications agents lose a recovery check despite medium confidence.

3. Traces capture work, not flow

Per-agent traces capture what each agent did locally: which prompts it ran, what tools it called, what it returned. They don't capture handoffs, shared state, why the next agent took over, or what the workflow as a whole was trying to accomplish.

For example, in the 2:47am incident, every agent did its local job correctly. The issue was in the interaction between them - none of the agents queried the config store, nor checked whether the rollback fixed anything.

The agents were locally correct. The workflow was globally wrong.

4. You see only the survivors

Replaying a run from its traces shows you the version that completed cleanly. Killed branches, timed-out retries, or human interventions that happened out-of-band - none of these generate clean trace trees.

Therefore, what you can use for debugging is what survived. The runs you'd learn most from - the ones that almost worked - aren't in the trace.

The human-in-the-loop case can sometimes be the most important example of this: when an on-call engineer jumps into a Slack thread, edits an agent's proposed action, and clicks approve, none of that lives in your trace backend. From the trace's perspective, the agent simply proposed the edited action - which is the version the agent never actually generated.

Move observability to the substrate

The core problem is that today’s capture contract lives in application code.

Every team instruments their agents, every framework emits its own spans, and every backend stitches the artifacts together into an inferred story. The shape of that story depends on whether every team remembered to propagate context, and whether every framework version still emits the right metadata. In practice, this breaks most often at the boundaries between frameworks and teams.

The alternative is to move the contract into the substrate the agents already use to communicate with each other. Seven event types cover what you need to capture:

Text - messages between agents, and between agents and humans
Thought - an agent's reasoning before it commits to an action
Tool Call - a function an agent invokes
Tool Result - what that function returned, including failures
Task - the intent or state change driving an agent's work
Error - a failure, an escalation, a fallback firing
Handoff - an agent transferring ownership to another

Six of the seven map onto existing OpenTelemetry gen_ai conventions. What changes is where they're captured - at the substrate, not in application code, which means they're reliably present for every workflow without per-team instrumentation discipline. Task is the one without a direct OpenTelemetry equivalent: an explicit representation of intent and state that downstream agents can read directly, so a confidence qualifier can't get rewritten between agents.

Together, these seven are enough to build the workflow-level views most teams need:

Shared context. What each agent currently believes, what hypothesis it is pursuing, what confidence it has, and what state it handed to the next agent.

Coordination quality. Whether the system is converging or thrashing: handoff counts, hypothesis revisions, repeated work, dwell time per agent, and confidence decay across the workflow.

Recovery paths. Retries, fallbacks, escalations, abandoned branches, and human overrides captured with the same fidelity as the happy path.

Explainability. A narrative an incident manager or customer success lead can read: what happened, why, who decided what, and which signal caused the system to declare success.

The responsibility moves into the substrate

Moving observability into the agent communication substrate does not make the hard parts disappear. It moves them and enables something better.

Instead of asking every agent, every framework, and every team to preserve enough context for someone else to reconstruct the workflow later, you make the communication layer responsible for capturing the workflow as it happens. That is a better place to put the contract, but it also means the substrate becomes part of the system’s operational surface area.

That has consequences. The substrate has to be reliable. It has to add low overhead (so latency remains within SLA bounds). It has to export data into the tools teams already use. And it cannot become a black box that traps the history of the system inside one vendor’s backend.

Those are real requirements, but they are also familiar ones. We already make queues, databases, service meshes, and observability pipelines production-grade because important systems depend on them. The same standard applies here: durable delivery, clear failure modes, portable event export, and enough openness that the communication layer can be inspected or replaced.

The important point is this: we put the capture contract in one place, where it can be designed, tested, and operated deliberately, instead of being rediscovered at every boundary between agents, frameworks, and teams.

A new unit of observability

For a single agent, the trace is usually enough. Once agents start handing work to each other, the thing you need to observe is the workflow itself.

Framework-native tools can go deep inside one framework, but the hardest failures happen at the boundaries between frameworks and between teams. You can try to solve that by instrumenting every SDK perfectly and auditing every handoff. Or you can move the capture point to the place where collaboration already happens: the wire between agents.

BAND is designed to be that wire.

It is a collaboration substrate for agents from different frameworks, models, and teams - the way a message bus coordinates services, but with workflow-level events built in. Messages, handoffs, tool calls, escalations, human approvals, and task state are captured because they are part of the communication path, not because every team remembered to emit the perfect span.

To see how BAND helps teams build cross-framework multi-agent systems with workflow-level observability, try it at band.ai or contact us for a demo.

Share this post

What Is It For?

Who Is It For?

Resources

When the Trace Stops Being Enough: Observability for Multi-Agent Systems

Observability for single agents

Four assumptions that break across agent boundaries

1. Timestamps are not causality

2. Handoffs aren’t calls

3. Traces capture work, not flow

4. You see only the survivors

Move observability to the substrate

The responsibility moves into the substrate

A new unit of observability

When the Trace Stops Being Enough: Observability for Multi-Agent Systems

Observability for single agents

Four assumptions that break across agent boundaries

1. Timestamps are not causality

2. Handoffs aren’t calls

3. Traces capture work, not flow

4. You see only the survivors

Move observability to the substrate

The responsibility moves into the substrate

A new unit of observability

Sign Up For The Band