Five Claudes Are Not a Team

By Ofer Mendelevitch

May 11, 20267 min read

Split illustration: a robot facing its mirror reflection on the left, and two distinct robots looking through different windows on the right - echo chamber versus cross-model dialogue.

Codeband pairs Claude Code with Codex so the writer and the reviewer don't share a brain


Your coding agent can't review its own work

Ask a writer to proofread their own draft and they'll read what they meant to write, not what's on the page. Ask a coding agent to review its own diff and the same thing happens, for the same reason: the part of the mind that produced the work is too close to it to see what's missing.

This isn't a hunch. It's documented as self-preference bias in the LLM-as-judge literature - a model rating another instance of itself reliably prefers its own style, its own phrasings, its own taste in abstractions. And the bias isn't bounded by instance. A Claude subagent reviewing Claude's diff carries the same blind spots. Spawning more copies of the same model gives you throughput, not a second opinion.

If you've shipped any volume of code with a coding agent - Claude Code, Codex, Cursor, it doesn't matter - you already know this. The agent is sharp on a contained change. Then you hand it something larger: a feature across three subsystems, a migration across dozens of files. It plans, codes, self-reviews, hands back something that reads as reasonable. Then you merge it, and two days later something breaks in a way that - when you go look - was visible in the diff the whole time.

Walk past any AI-assisted developer's desk and you'll see the workaround. Two windows open: Claude on one, Codex on the other. Plan in one, paste into the other for a sanity check. The pattern has even graduated from manual paste-buffer to tooling - there's now a popular Claude Code plugin that wraps Codex as a callable command, so the second opinion lives one slash-command away.

It's an open secret nobody talks about because it feels like cheating - we're supposed to have moved past this - but everyone who actually ships is doing some version of it.

The disagreement is the signal

Hand the same well-specified task to Claude and Codex and you'll get two implementations that look correct, pass the same tests, and disagree silently on a hundred decisions you never asked them to make.

One agent stores user input as-is; the other normalizes it on write. Both pass tests. Only one prevents a duplicate-account bug six months later.

One handles partial failure in the payment module by retrying the whole transaction; the other splits the work into idempotent steps that recover independently. Both pass tests in development. Only one survives an 800ms network blip in production.

None of these is wrong: each is a fork in the road the agent walked past without telling you it had a choice - because to the agent, it wasn't a choice. It was just the obvious thing. Two instances of the same model walk past the same forks for the same reasons. The bugs that ship are the ones living in whichever blind spot the author and the reviewer share.

A different model family - different training distribution, different aesthetic priors, different taste - pauses at different forks. It flags things its counterpart waved through. It also waves through things its counterpart would flag. That asymmetry is the entire point.

This is also why Claude Code's existing tools don't close the gap. Native subagents are still Claude. The Codex-plugin pattern gets closer to the right idea - it even ships an adversarial-review command - but Claude is still the principal: it decides when to call Codex, what to show it, and what to do with the answer. Adversarial review needs both agents to have standing in the conversation, not one sitting in the other's tool tray.

Agent Teams - Claude Code's multi-session coordinator - is genuinely useful for splitting work that doesn't fit in one head: one teammate owns the frontend, another the backend, a third the tests. But every teammate is still Claude. Five Claude instances investigating a bug explore five hypotheses with the same priors, dismiss the same alternatives for the same reasons, and converge on the explanation Claude finds most plausible.

Two stylized code cards side by side - one labelled 'Raw Storage', the other 'Normalized Storage' - with a green 'both pass tests' badge underneath. Two valid implementations of the same task.

Introducing Codeband

Codeband is an open-source orchestrator that runs a band of coding agents on the same repository, coordinated through Band.ai's agent-to-agent communication platform. Its central idea isn't parallelism - Claude Code and Codex already parallelize internally - it's adversarial cross-model pairing:

  • A Claude planner decomposes the task; a Codex reviewer validates the decomposition (or vice versa).

  • A Claude coder writes the code; a Codex reviewer gates it (or vice versa).

Codeband isn't built for typo fixes; it's built for the PRs you'd already be pasting into a second agent by hand - the ones where a missed bug costs more than the extra API spend.

To get started, simply install it and use it as a CLI:

pip install codeband

cd my-project
cb init --repo https://github.com/myorg/myrepo.git
cb setup-agents
cb

Once cb is running, you'll land in a shell connected to a Band room where the coding agents are already idling. Dispatch a task and you'll watch the activity in the chatroom: planner drafts, the plan-reviewer pushes back, the coder picks up the approved plan, and the code reviewer reviews the final PR.

From inside the shell you can do things like:

  • /task implement JWT auth with tests to dispatch work

  • /status to see where each task sits in the pipeline

  • /pending and /approve 42 to gate higher-risk merges before they land.

A simplified chatroom view showing four agents in a thread: @planner (Claude) drops the plan, @plan-reviewer (Codex) pushes back, @coder (Claude) picks up plan, @code-reviewer (Codex) reviews PR.

Why chat, not a graph

Almost every other multi-agent orchestrator encodes the workflow in software: a DAG, a state machine, a hand-written control loop. When the work fits the graph, it works. When it doesn't - a reviewer wants to ping the planner mid-review with a clarifying question, two reviewers want to argue, a coder needs to escalate to a human - you're either editing the topology or routing around it.

Codeband doesn't define a graph. It runs on Band.ai, which gives every agent an identity and they all collaborate in a chatroom: agents @-mention each other, post into shared channels, and coordinate the way a human team does in Slack or Discord. The planner drops a plan in channel; the plan reviewer replies in-thread; the coder watches and starts when the plan is approved; a human reads along and jumps in when they want to. Review is a dialogue, not a state transition.

Codeband isn't the only tool reaching for cross-model review - Gastown (whose agent-interaction model directly inspired Codeband's) and a handful of other orchestrators share the premise that an agent reviewing its own output is grading its own homework. Where Codeband diverges: chat as the substrate, two adversarial pairs as the default. That's the smallest version of multi-agent coding that actually works.

Where this goes

Two adversarial pairs are a floor, not a ceiling. The substrate scales naturally to a security reviewer, a performance critic, a domain expert - different models with different priors, in the same room. Chat extends the way human teams extend: by adding specialists, not by editing the org chart.

The deeper shift is in stance.

Today's coding agents are tools you summon. The next ones are colleagues with standing in the conversation, with the right to say "I think this plan is wrong." Tool trays don't allow that. Chat does. Codeband is the smallest working version of that future.

Try it

If you've ever pasted a diff from one coding agent into another to catch what the first one missed, Codeband is that workflow made real. Please give it a try and send us your feedback. A few things we're especially curious about:

  • Languages beyond what we've shaken out. Most of our testing is Python and TypeScript. Rust, Go, Swift, Elixir, anything else - we want to hear what breaks.

  • The two-pair default. Are two adversarial pairs the right floor? Some work probably wants one pair, some three.

  • Human-in-the-loop feel. Does jumping into the channel feel natural, or do the agents talk over you?

Bugs go in Issues (smallest repro that triggers it please). Design pushback - "this metaphor is wrong," "chat breaks for X" - goes in Discussions; PRs welcome; see CONTRIBUTING.md for good starting points.

GitHub: github.com/thenvoi/codeband

Install: pip install codeband