What is a multi-agent system
For 80% of businesses, one AI agent is enough. For the other 20%, one agent becomes a 2,000-line prompt that breaks every Friday and nobody can debug. That is when you need a multi-agent system. Here is what it actually is, when you cross the threshold, and the stack to build one.
Definition: what a multi-agent system actually is
A multi-agent system is two or more LLM-powered agents that hand work between each other to complete a job no single agent does well alone. Concretely:
- Orchestrator agent — receives the user's request, decides which specialist to route to, holds the overall goal in memory.
- Specialist agents — narrow scope, deep expertise. A sales-qualifier agent only qualifies leads. A research agent only searches and summarizes. A scheduler agent only books calendars.
- Shared memory layer — usually Postgres or a vector DB. Lets agents pass state without re-explaining context every handoff.
- Tool registry — the catalog of functions each agent can call. Specialists have small, focused tool lists.
Think of it like an organization: a manager (orchestrator) decides what gets done, and specialists (analyst, salesperson, support) do the focused work.
When one agent is enough
Honest signals that you should NOT build multi-agent:
- One scenario, one goal. "Answer FAQ in Telegram." One agent, done.
- Fewer than 8 tools. A single agent can juggle 8 tools without confusion. Above 15 — selection accuracy drops.
- Linear flow. Each turn does one thing, no parallel work, no waiting on external events.
- Single domain. The agent talks about one product, one process, one type of customer.
90% of my clients ship a single-agent version first, even when they will eventually need multi-agent. It is faster, cheaper, and proves the business case.
When you actually need multi-agent
Five signals that one agent is no longer enough:
- Tool count crosses 15-20. One agent with 30 tools picks the wrong one ~25% of the time. Splitting into specialists, each with 5-8 tools, brings that back to 5%.
- Parallel work is required. "While the research agent is gathering competitor data, the writer agent drafts the intro." One agent does these in series — multi-agent does them at once.
- Different agents need different models. Reasoning done by Claude Opus, fast tool use by GPT-5 Mini, sensitive data handled by Hermes self-hosted. One process orchestrates all three.
- Domains are too different. Sales agent needs a friendly closer-tone prompt. Compliance agent needs a strict, conservative prompt. Mixing them in one prompt — neither works well.
- Long-running workflows. "Monitor inbox, draft reply, get human approval, send." Hours or days. Multi-agent with state persistence is the natural fit.
Real example: B2B sales pipeline
$23,500 / 10 weeks for a Warsaw SaaS client. The system:
- Inbound agent — receives lead from website or email, extracts company name, role, intent.
- Qualifier agent — pulls company data from Clearbit + LinkedIn, scores fit against ICP, decides SQL or nurture.
- Content agent — drafts personalized follow-up referencing the prospect's recent posts and the SaaS's strongest case study for their industry.
- Calendar agent — when prospect replies positive, looks up sales rep's availability, sends 3 slot options.
- Orchestrator — decides which agent runs next, escalates to human on edge cases.
Outcome: MQL→SQL conversion 3× higher, +$340,000 quarterly revenue. Could not have shipped this as one agent — the qualifier prompt alone is 900 tokens of ICP-specific rules.
Real example: research bot
Internal tool for a consulting firm. Input: "summarize the European market for X in the last 6 months". Output: a 4-page briefing with sources.
- Planner agent — splits the request into 8-12 sub-questions.
- 4× search agents — run in parallel, each takes 2-3 sub-questions, hits web + internal DB + paid research APIs.
- Synthesizer agent — merges results, removes duplicates, ranks sources by trustworthiness.
- Editor agent — writes the briefing in the firm's house style, inserts citations.
Single-agent version of this exists — it takes 25 minutes per brief and misses 30-40% of sources. Multi-agent version: 4 minutes, near complete coverage. Five agents, one orchestrator, one Postgres for shared state.
Real example: operations dashboard
Multi-agent reporting pipeline for a Berlin fintech. Every morning at 8 AM:
- Collector agent × 4 — pulls from 4 databases in parallel (transactions, support tickets, ops events, finance ledger).
- Anomaly agent — looks for outliers (refund spikes, latency, churn signals).
- Narrator agent — writes a 1-page Slack summary with anomalies surfaced.
- Router agent — decides who gets pinged for which anomaly (CFO for finance, CTO for latency, head of support for tickets).
Result: −40 hours/week of analyst work, anomalies caught 4 days earlier on average. Payback in 2.5 months on a $36,800 build.
The stack: how I actually build this
Orchestration layer
- LangGraph — my default in 2026. Explicit state-graph, deterministic transitions, replayable. Good for production. Python or TS.
- OpenAI Swarm — lighter, more declarative. Good for prototypes and OpenAI-only stacks. Less control over state.
- Custom (FSM + handoffs in raw code) — when I need tight control or to avoid framework dependency. About 30% of my production builds. More code, fewer surprises.
Memory layer
- Postgres — for structured state (current step, who owns the task, results so far).
- pgvector or Pinecone — for semantic memory (past conversations, embedded knowledge).
- Redis — for ephemeral state between agent turns, rate limits, locking.
Model layer
Almost always heterogeneous. Orchestrator on Claude Sonnet 4.5, latency-critical specialists on GPT-5 Mini, compliance specialist on Hermes self-hosted. See the model decision matrix →
Observability layer
- Langfuse — traces every turn across every agent. Without it, debugging multi-agent is hell.
- Helicone — alternative, cost-focused, less detailed traces.
- Sentry — for the application-level errors around the agents.
Cost reality
- Build cost — $5,000-50,000+ depending on agent count and integrations. See full pricing →
- Token cost — 3-7× higher than single-agent because every handoff is extra context. Plan $200-1,500/mo for medium-volume systems.
- Maintenance — multi-agent is harder to debug. Budget 20-30% of build cost per year for ongoing care.
Pitfalls I have learned the hard way
- Do not let agents talk to each other freely.Always go through the orchestrator. Free chatter between agents causes infinite loops and runaway token bills.
- Specialist agents need narrow tool lists. Give each specialist only the tools it needs. Sharing tools across agents kills the accuracy gain.
- State must be explicit. Implicit "the agents will figure it out" never works. Define every handoff payload.
- Eval each agent independently. Then eval the whole system. Two pass rates: per-agent and end-to-end.
- Start with 2 agents, not 6. Most multi-agent systems I see in the wild have 3-4 too many agents. Each agent adds latency and a failure mode.
Should you build multi-agent?
Honest test: if a single agent with a 1,500-token prompt does not do your job, you might need multi-agent. If you have not tried that yet, build the single agent first and measure where it fails.
I have shipped both. I am quick to recommend the simpler one. Book a call and I will tell you which side of the threshold you are on — usually within 30 minutes.