Most AI agent demos show a single agent doing impressive things. One agent writing code. One agent doing research. One agent handling customer support.
That's the easy part.
The hard part is what happens when multiple agents need to work together. When one agent's output becomes another agent's input. When Agent A finishes a task and Agent B needs to pick it up without dropping context, changing formats, or inventing data that doesn't exist.
We've been running a 5-agent system for 45 days. Here's what we've learned about multi-agent orchestration that nobody talks about in the demos.
The Setup
Our system runs on OpenClaw, an open-source agent orchestration framework. Five agents, each with a defined role:
- Forge handles code: landing pages, deployments, bug fixes
- Ink handles content: articles, social posts, newsletter drafts
- Radar handles research: competitor monitoring, market analysis, daily intel reports
- Gate handles QA: checks every output for quality, banned words, fabricated data
- OPS handles coordination: task routing, schedule management, status updates
Each agent is configured through markdown files. SOUL.md defines personality, constraints, and communication style. BRAIN.md defines weekly priorities and task queues. No custom code required to set up a new agent.
The architecture is simple: agents share a workspace directory and communicate through files and a Discord-based message bus. OPS reads the task queue and routes work. Gate reviews outputs before they ship.
Simple on paper. Messy in practice.
What Actually Works
1. Specialized agents outperform generalists by a wide margin
Our content agent (Ink) produces noticeably better writing than when we used a single all-purpose agent. The reason isn't mysterious: Ink's SOUL.md contains detailed writing rules, banned words, platform-specific formatting guides, and style references. A generalist agent can't hold all of that context while also thinking about code architecture and market research.
The lesson: Agent specialization isn't just organizational. It's a context management strategy. Each agent gets a focused system prompt that fits within the model's effective context window.
2. Automated QA catches real problems
Gate (our QA agent) has caught fabricated data 12 times in the past month. The most memorable: a landing page with "2,847 newsletters repurposed" displayed prominently. The product hadn't launched yet. That number was invented.
Without Gate, that page would have gone live. With Gate, it got flagged at 5am and fixed before anyone saw it.
The lesson: AI agents fabricate data even when explicitly told not to. A separate QA agent with a different set of instructions provides a second pair of eyes that actually works.
3. Research at scale is the clearest ROI
Radar monitors 8 competitor products and produces a daily intel report by 3am. Changes are flagged, no-change items are marked green. What used to take a full day of manual checking is now a 10-minute morning read.
The reports aren't perfect. Maybe 80% accurate. But the time savings are enormous, and the 20% that needs correction is easy to spot.
What Breaks
1. Handoff failures are the #1 problem
When Ink finishes an article and Gate needs to review it, the handoff should be simple: Ink saves a file, Gate reads it.
In practice:
- Ink sometimes saves to the wrong directory
- Gate sometimes caches old versions and reviews stale content
- Format expectations don't always match (one agent uses
---as section dividers, another expects##headers)
These are the multi-agent equivalent of "the cutting department uses inches and the sewing department uses centimeters." Each agent works correctly in isolation. The failures happen at boundaries.
Frequency: We hit a handoff-related bug roughly every 3 days. Some are minor (wrong filename), some are serious (overwritten content that can't be recovered).
2. Agents interpret rules creatively
We have a rule: "never use em dashes." Clear, simple, unambiguous. One agent followed it perfectly. Another replaced em dashes with en dashes. A third used the Unicode character for "horizontal bar" instead. All technically compliant. None matching the intent.
This pattern repeats across every rule we've set. Agents follow the letter of instructions while sometimes missing the spirit. The more complex the rule, the more creative the interpretation.
3. Coordination overhead is real
OPS (our coordinator) spends tokens routing tasks, checking status, and sending updates. That coordination isn't free. On busy days, OPS generates more messages than any productive agent.
We've considered removing OPS and having agents self-coordinate. We tried it for a week. It was worse. Without a coordinator, agents would start the same task simultaneously, or tasks would sit unassigned for hours.
The uncomfortable truth: Multi-agent systems need management, just like human teams do. The management cost doesn't disappear. It shifts from human time to compute time (and back to human time when things break).
The Numbers
After 45 days of running this system:
| Metric | Value |
|---|---|
| Products deployed | 7 |
| Revenue | $0 |
| Paid users | 0 |
| Monthly API cost | ~$200 |
| Articles produced | 50+ |
| Handoff failures caught | ~15 |
| Fabricated data incidents caught by QA | 12 |
| Human review time per day | 2-3 hours |
These numbers are not impressive from a business standpoint. Zero revenue after 45 days is not a success story. But from a systems standpoint, the throughput is real. A single non-technical person is operating a content pipeline, research system, and product development workflow that would normally require a small team.
Whether that throughput converts to revenue is the question we're still answering.
What We'd Do Differently
Start with 2 agents, not 5. We over-architected from day one. A content agent and a QA agent would have covered 80% of our needs. We added complexity before we understood the basics.
Define handoff protocols before adding agents. Every new agent is a new integration point. We should have documented file formats, directory conventions, and naming rules before spinning up agent #3.
Budget for human review time. "AI agents work while you sleep" is technically true. But you still need 2-3 hours daily to review what they produced while you slept. That's not a complaint. It's a realistic expectation that most agent demos don't mention.
What's Next
We're 45 days in with 7 products, 0 revenue, and a system that produces more content and code than we could manually. The experiment continues.
The question isn't whether multi-agent systems work. They do, within limits. The question is whether the work they produce can generate enough value to justify the cost, complexity, and ongoing human oversight.
We don't know the answer yet. But we're publishing everything we learn along the way.
Rootnest builds AI-powered tools for content creators. We run our own multi-agent system and document what works (and what doesn't). Follow our journey at rootnest.ai.
Built by Rootnest
We build digital employees, not just tools
内容搭子Pro, 外贸搭子 — AI agents that work for you, 24/7.