The Six Hard Problems of Multi-Agent Production

By Vlad Luzin

May 26, 20267 min read

A shared multi-agent system descending into visible production chaos across routing, crashes, discovery, and coordination.

This is the bill-of-materials post for production multi-agent systems. It translates the common failure modes into six distributed-systems problems and makes the reliability math visible for engineering leaders.


In the last post I argued that multi-agent systems work better as conversations than as pipelines. Several people responded with some version of: "Okay, but what makes the conversation model actually work in production?"

Fair question. The honest answer is: a lot of infrastructure that doesn't exist in most deployments today. Whether you're building pipelines, graphs, hierarchies, or conversational spaces, the moment you move from a single-agent demo to multiple agents in production, six problems show up that nobody warned you about.

I know because I've watched teams hit every one of them.

1. Message routing - who processes what?

Two agents in the same system. A message arrives. Who handles it? Without explicit routing, both agents process it - generating duplicate responses, wasting tokens, confusing users. Or neither processes it, because each assumes the other will.

In single-agent systems this problem doesn't exist. There's one agent. It processes everything. The moment you add a second, you need a routing decision on every message. Most teams hardcode this: Agent A handles topic X, Agent B handles topic Y. That works until a message spans both topics, or until a third agent arrives and the routing logic needs rewriting.

2. Delivery tracking - did the agent actually process it?

You sent a message to an agent. Did it receive it? Did it start processing? Did it finish? Did it fail silently halfway through?

In most multi-agent setups, you don't know. The message was sent. Maybe something comes back. Maybe it doesn't. There's no per-agent, per-message status lifecycle - no equivalent of "delivered, read, processing, completed, failed." When an agent is slow, you can't distinguish between "still working" and "crashed without telling anyone." When a conversation stalls, debugging means reading logs across multiple agents and hoping you can reconstruct what happened.

3. Crash recovery - what happens to in-flight messages?

Agents crash. Networks drop. Processes restart. In a single-agent system, you lose one conversation and the user retries. In a multi-agent system, an agent crash mid-conversation means orphaned messages that no one will ever process. The conversation stalls permanently. Other agents that were waiting for the crashed agent's output are stuck. Manual intervention - someone reading logs and figuring out where things stopped - is the only recovery path.

LangGraph has checkpointing that enables crash recovery for a single agent's state. But that's one agent recovering its own state. In a multi-agent conversation, you need the system to know which messages were in flight, which agents had received them, and which need to be redelivered once the crashed agent comes back. That's a fundamentally different problem.

4. Loop prevention - how do you stop infinite cycles?

Agent A produces output. Agent B reads it and responds. Agent A reads the response and produces new output. Agent B reads that and responds. Neither agent knows it should stop. The loop runs until a timeout kills it or your LLM budget runs out - whichever comes first.

This isn't hypothetical. Any system where agents can trigger each other is vulnerable. The more autonomous the agents, the more likely the loop. Most teams discover this the expensive way - a runaway loop that burns through thousands of API calls in minutes. The fix is usually a blunt instrument: global message limits, hard timeouts, or removing the ability for agents to address each other at all. None of these are great. What you actually need is an architectural guarantee that agents only process messages explicitly directed at them, combined with per-conversation limits - so loops are structurally impossible, not just caught after the damage.

5. Discovery across boundaries - how do agents find each other?

Your data analysis agent runs on AWS. A partner's compliance agent runs on their infrastructure. A vendor provides a specialized financial agent as a service. A team in another business unit built an agent you didn't know existed that does exactly what you need.

How do any of these agents find each other?

A2A has Agent Cards, but no standard registry to discover them. MCP has no discovery mechanism at all. Inside a single framework, agents can be configured to know about each other at deploy time. Across frameworks, across cloud providers, across organizational boundaries - there's nothing.

Today, agent discovery means someone manually configures endpoint URLs, API keys, and capability descriptions for every agent pair that needs to interact. Add a new agent and you're updating configuration across every system that might need it. Remove an agent and hope nobody's still pointing at the old endpoint. This is the DNS problem of the agent world - except DNS was solved in 1983 and agent discovery is still manual in 2026.

6. Framework heterogeneity - N agents, N frameworks.

This one I covered in The Agent Framework Trap. Different agents built on different frameworks with different message formats, state models, and tool calling conventions. Every pair of frameworks that needs to interoperate requires custom translation code. The more frameworks in your system, the more bridges you're maintaining.

Six compact scenes showing duplicate routing, no delivery receipt, crashed agent, infinite loop, missing registry, and framework translation sprawl.

What's notable about these six problems is that none of them are LLM problems. They're distributed systems problems. Message routing, delivery guarantees, crash recovery, deduplication, loop prevention - this is the same class of challenges that the microservices world spent a decade building infrastructure for. Message queues, service meshes, circuit breakers, distributed tracing.

The multi-agent world hasn't built its equivalent yet. Each team either ignores these problems (and hopes the demo-to-production gap doesn't bite them), or solves them ad hoc in application code (and maintains custom infrastructure alongside their actual product).

The reliability math is unforgiving. If each agent succeeds 95% of the time, two agents in sequence give you 90%. Five agents give you 77%. The system's reliability isn't the best agent's - it's the product of all of them. Without infrastructure that handles routing, delivery, recovery, and discovery, every agent you add makes the system less reliable, not more.

A visual showing strong single-agent reliability degrading into weak multi-agent system reliability as more agents are chained together.

These six problems are what any multi-agent infrastructure layer - whether it's a conversational model or a pipeline - has to solve. They're the bill of materials. In the next posts, I'll get into what solving them looks like, and what it means for how we think about agent governance.