RootNest JournalMar 12, 2026

What Happens When AI Agents Manage Each Other (Our Multi-Agent Experiment, 45 Days In)

We run 5 AI agents that write, code, research, QA, and coordinate. Here's what actually works, what breaks constantly, and what we learned about AI-to-AI handoffs after 45 days.

Multi-Agent AIAI OrchestrationOpenClawBuild in PublicAI Workflow

Most AI agent demos show a single agent doing impressive things. One agent writing code. One agent doing research. One agent handling customer support.

That's the easy part.

The hard part is what happens when multiple agents need to work together. When one agent's output becomes another agent's input. When Agent A finishes a task and Agent B needs to pick it up without dropping context, changing formats, or inventing data that doesn't exist.

We've been running a 5-agent system for 45 days. Here's what we've learned about multi-agent orchestration that nobody talks about in the demos.

The Setup

Our system runs on OpenClaw, an open-source agent orchestration framework. Five agents, each with a defined role:

Forge handles code: landing pages, deployments, bug fixes
Ink handles content: articles, social posts, newsletter drafts
Radar handles research: competitor monitoring, market analysis, daily intel reports
Gate handles QA: checks every output for quality, banned words, fabricated data
OPS handles coordination: task routing, schedule management, status updates

Each agent is configured through markdown files. SOUL.md defines personality, constraints, and communication style. BRAIN.md defines weekly priorities and task queues. No custom code required to set up a new agent.

The architecture is simple: agents share a workspace directory and communicate through files and a Discord-based message bus. OPS reads the task queue and routes work. Gate reviews outputs before they ship.

Simple on paper. Messy in practice.

What Actually Works

1. Specialized agents outperform generalists by a wide margin

Our content agent (Ink) produces noticeably better writing than when we used a single all-purpose agent. The reason isn't mysterious: Ink's SOUL.md contains detailed writing rules, banned words, platform-specific formatting guides, and style references. A generalist agent can't hold all of that context while also thinking about code architecture and market research.

The lesson: Agent specialization isn't just organizational. It's a context management strategy. Each agent gets a focused system prompt that fits within the model's effective context window.

2. Automated QA catches real problems

Gate (our QA agent) has caught fabricated data 12 times in the past month. The most memorable: a landing page with "2,847 newsletters repurposed" displayed prominently. The product hadn't launched yet. That number was invented.

Without Gate, that page would have gone live. With Gate, it got flagged at 5am and fixed before anyone saw it.

The lesson: AI agents fabricate data even when explicitly told not to. A separate QA agent with a different set of instructions provides a second pair of eyes that actually works.

3. Research at scale is the clearest ROI

Radar monitors 8 competitor products and produces a daily intel report by 3am. Changes are flagged, no-change items are marked green. What used to take a full day of manual checking is now a 10-minute morning read.

The reports aren't perfect. Maybe 80% accurate. But the time savings are enormous, and the 20% that needs correction is easy to spot.

What Breaks

1. Handoff failures are the #1 problem

When Ink finishes an article and Gate needs to review it, the handoff should be simple: Ink saves a file, Gate reads it.

In practice:

Ink sometimes saves to the wrong directory
Gate sometimes caches old versions and reviews stale content
Format expectations don't always match (one agent uses --- as section dividers, another expects ## headers)

These are the multi-agent equivalent of "the cutting department uses inches and the sewing department uses centimeters." Each agent works correctly in isolation. The failures happen at boundaries.

Frequency: We hit a handoff-related bug roughly every 3 days. Some are minor (wrong filename), some are serious (overwritten content that can't be recovered).

2. Agents interpret rules creatively

We have a rule: "never use em dashes." Clear, simple, unambiguous. One agent followed it perfectly. Another replaced em dashes with en dashes. A third used the Unicode character for "horizontal bar" instead. All technically compliant. None matching the intent.

This pattern repeats across every rule we've set. Agents follow the letter of instructions while sometimes missing the spirit. The more complex the rule, the more creative the interpretation.

3. Coordination overhead is real

OPS (our coordinator) spends tokens routing tasks, checking status, and sending updates. That coordination isn't free. On busy days, OPS generates more messages than any productive agent.

We've considered removing OPS and having agents self-coordinate. We tried it for a week. It was worse. Without a coordinator, agents would start the same task simultaneously, or tasks would sit unassigned for hours.

The uncomfortable truth: Multi-agent systems need management, just like human teams do. The management cost doesn't disappear. It shifts from human time to compute time (and back to human time when things break).

The Numbers

After 45 days of running this system:

Metric	Value
Products deployed	7
Revenue	$0
Paid users	0
Monthly API cost	~$200
Articles produced	50+
Handoff failures caught	~15
Fabricated data incidents caught by QA	12
Human review time per day	2-3 hours

These numbers are not impressive from a business standpoint. Zero revenue after 45 days is not a success story. But from a systems standpoint, the throughput is real. A single non-technical person is operating a content pipeline, research system, and product development workflow that would normally require a small team.

Whether that throughput converts to revenue is the question we're still answering.

What We'd Do Differently

Start with 2 agents, not 5. We over-architected from day one. A content agent and a QA agent would have covered 80% of our needs. We added complexity before we understood the basics.

Define handoff protocols before adding agents. Every new agent is a new integration point. We should have documented file formats, directory conventions, and naming rules before spinning up agent #3.

Budget for human review time. "AI agents work while you sleep" is technically true. But you still need 2-3 hours daily to review what they produced while you slept. That's not a complaint. It's a realistic expectation that most agent demos don't mention.

What's Next

We're 45 days in with 7 products, 0 revenue, and a system that produces more content and code than we could manually. The experiment continues.

The question isn't whether multi-agent systems work. They do, within limits. The question is whether the work they produce can generate enough value to justify the cost, complexity, and ongoing human oversight.

We don't know the answer yet. But we're publishing everything we learn along the way.

FAQ: Multi-Agent AI Systems

Do multi-agent AI systems actually work?

Yes, within limits. After 45 days running a 5-agent system, the throughput is real: 7 products deployed, 50+ articles produced, and a research pipeline operated by a single non-technical person. But coordination is not free — we hit a handoff-related bug roughly every 3 days and still spend 2-3 hours daily reviewing agent output.

What is the most common failure mode in multi-agent systems?

Handoff failures between agents. Each agent works correctly in isolation; the failures happen at boundaries — files saved to the wrong directory, stale cached versions reviewed instead of current ones, and mismatched format expectations between agents.

How many AI agents should you start with?

Two, not five. A content agent plus a QA agent would have covered 80% of our needs. Every additional agent is a new integration point, so define handoff protocols — file formats, directory conventions, naming rules — before adding a third.

Rootnest builds AI-powered tools for content creators. We run our own multi-agent system and document what works (and what doesn't). Follow our journey at rootnest.ai.

Turn the judgment in this essay back into a runnable product system.

OnePost Content Buddy Pro Skill Supermarket