MH (Frank) Tsai
AI Solutions Architect
Designing a Production Multi-Agent Architecture
When we first shipped our AI assistant, it was a single LLM call with a long system prompt. It worked — for about two weeks. Then the feature requests started piling up: handle restaurant search, process bookings, answer FAQs, manage customer support escalations. One agent trying to do everything meant it did nothing well.
This post documents how we redesigned the system into a hierarchical multi-agent architecture, the decisions that shaped it, and what we'd do differently.
The Problem with Monolithic Agents
A single-agent approach seems elegant until you hit these walls:
- Context window bloat. Every capability adds instructions, examples, and tool definitions. The system prompt grew past 8,000 tokens before we ran out of room for actual conversation context.
- Conflicting behaviors. The search agent needs to be exploratory and creative. The booking agent needs to be precise and transactional. One temperature setting can't serve both.
- Evaluation becomes impossible. When one prompt handles five jobs, you can't tell whether a regression in search quality was caused by a booking prompt change.
The Architecture: Hierarchical Routing
We settled on a supervisor-worker pattern:
┌─────────────┐
User ───> │ Supervisor │
│ (Router) │
└──────┬──────┘
│
┌──────────────┼──────────────┐
│ │ │
┌──────▼──────┐ ┌────▼────┐ ┌──────▼──────┐
│ Search Agent │ │ Booking │ │ CS Agent │
│ │ │ Agent │ │ │
└──────────────┘ └─────────┘ └─────────────┘
│ │
┌──────▼──────┐ ┌──────▼──────┐
│ FAQ Agent │ │ Escalation │
└──────────────┘ └─────────────┘
The Supervisor classifies user intent and delegates to the right specialist. It doesn't answer questions — it routes them. Its prompt is short and focused: "Given this conversation, which agent should handle the next turn?" It outputs a structured routing decision with the target agent and a brief context summary.
Specialized Agents each have their own system prompt, tool set, temperature, and model. The search agent uses a higher temperature and has access to Elasticsearch tools. The booking agent runs at temperature 0 with transactional tools. The CS agent has access to order history and escalation protocols.
Why Not a Flat Router?
We considered a flat approach where the router picks from all agents equally. The problem was multi-turn coherence. A user might start searching, find a restaurant, then want to book it. The flat router would re-classify each turn independently, losing the thread.
The hierarchical design solves this with session affinity. Once the supervisor delegates to an agent, that agent owns the conversation until it either completes its task or explicitly hands back control. The supervisor only re-routes on explicit topic changes or agent timeout.
Inter-Agent Communication
We tried two patterns before landing on our current approach:
Direct function calls (v1): Agents called each other's functions directly. Fast, but created tight coupling. Changing the search agent's interface broke the booking agent's "search then book" flow.
Message bus (v2, current): Agents communicate through a shared context object. Each agent reads from and writes to a structured conversation state. The supervisor manages the state transitions. This decoupled the agents — we could swap the search agent's implementation without touching anything else.
The context object looks roughly like this:
{
session_id: "...",
current_agent: "search",
conversation_history: [...],
agent_state: {
search: { last_query, results, filters },
booking: { selected_venue, time_slot, status }
},
routing_history: [
{ from: "supervisor", to: "search", reason: "user asked for recommendations" }
]
}
Failure Modes We Designed For
-
Routing loops. The supervisor routes to search, search can't handle it, hands back, supervisor routes to search again. We added a loop detector: if the same agent is routed to 3 times in a row without progress, escalate to CS with human handoff.
-
Agent timeout. If an agent takes longer than 30 seconds (usually due to slow tool calls), the supervisor takes back control and responds with a graceful fallback message.
-
Confidence threshold. The supervisor outputs a confidence score with each routing decision. Below 0.6, it asks a clarifying question instead of routing blindly. This eliminated most misrouted conversations.
-
Graceful degradation. If a specialized agent's tools are down (e.g., booking API outage), it returns a structured error. The supervisor catches this and either tries an alternative path or explains the limitation to the user.
Observability: What We Track
Multi-agent systems are notoriously hard to debug. We built observability into the architecture from day one:
- Routing trace. Every conversation logs which agents handled which turns, with the supervisor's reasoning. This is invaluable for debugging "why did the bot say that?"
- Per-agent cost tracking. Each agent's token usage is tracked separately. We discovered our FAQ agent was consuming 40% of total tokens because its prompt was too long — trimming it saved meaningful cost.
- Latency breakdown. End-to-end latency decomposed into: routing decision time, agent thinking time, tool call time. The routing step adds ~200ms overhead, which we accepted as a worthwhile tradeoff.
- Quality metrics per agent. Each agent has its own evaluation suite (via promptfoo). We can detect regressions in one agent without noise from others.
Key Tradeoffs
| Decision | Tradeoff | |----------|----------| | Hierarchical vs flat routing | +Multi-turn coherence, -Added latency (~200ms) | | Session affinity | +Context preservation, -Risk of stuck sessions | | Message bus vs direct calls | +Loose coupling, -Debugging indirection | | Per-agent models | +Optimized behavior, -More model configs to manage | | Confidence threshold routing | +Fewer misroutes, -More clarifying questions |
What We'd Do Differently
- Start with fewer agents. We launched with 6 and merged two of them within a month. Start with 3, split when evaluation data tells you to.
- Build the evaluation suite first. We built agents, then tried to evaluate them. It should be the other way around — define what "good" looks like for each agent before writing the prompts.
- Structured output from day one. Early agents returned free-text that the supervisor had to parse. Moving to structured JSON output for all inter-agent communication eliminated an entire class of bugs.
Conclusion
A multi-agent architecture isn't inherently better than a single agent. It's better when you need independent optimization of different capabilities, clear evaluation boundaries, and team-level ownership of different parts of the AI system. The overhead — routing latency, added complexity, more configs — is real. But for a production platform handling diverse user intents across multiple channels, the modularity has paid for itself many times over.
