How We Built a Multi-Agent AI Platform in 5 Weeks

D2C brands are flying blind. They're spending $50k–$500k per month on Meta and Google ads, and their reporting setup is a spreadsheet someone built in 2022, a GA4 dashboard nobody trusts, and a weekly Slack message from their media buyer. The signal is buried under latency, attribution noise, and creative fatigue that compounds silently until the ROAS tanks.

The client needed a system — not a dashboard, not another BI tool — that could actively monitor, diagnose, and surface insights across performance, creative, attribution, benchmarking, and alerting, all from a single coherent view. We shipped the first production version in 5–6 weeks.

Here's how we built it.

The Architecture: Five Agents, One Orchestrator

The system is organized around five specialized agents, each owning a distinct domain:

─Performance Agent — monitors spend, ROAS, CPM, CTR, and frequency across campaigns and ad sets. Detects budget exhaustion patterns and trend inflections.
─Creative Agent — tracks creative fatigue signals. Knows when a video has been served too many times to the same audience segment and surfaces replacement recommendations.
─Attribution Agent — reconciles platform-reported conversions against first-party data. Handles the post-iOS14 mess of modeled conversions and view-through attribution.
─Benchmark Agent — compares current performance against historical baselines and industry benchmarks. Puts a number to "is this actually bad or just Tuesday?"
─Alerting Agent — the final layer. Consumes structured outputs from the other four agents and decides what merits a notification, at what urgency, and to whom.

These agents don't run in parallel free-for-all. They're orchestrated using LangGraph, which gives us explicit control over the execution graph — what runs when, what depends on what, and where human-in-the-loop steps sit.

How Agents Coordinate with LangGraph

LangGraph models the system as a directed graph of nodes and edges. Each agent is a node. State flows between nodes as a typed dict that accumulates context across the pipeline.

Here's a simplified version of the Performance Agent node:

from langgraph.graph import StateGraph, END
from typing import TypedDict, Annotated
import operator
 
class AgentState(TypedDict):
    campaign_data: dict
    performance_signals: Annotated[list, operator.add]
    creative_signals: Annotated[list, operator.add]
    attribution_data: dict
    benchmark_comparison: dict
    alerts: Annotated[list, operator.add]
    run_metadata: dict
 
def performance_agent(state: AgentState) -> AgentState:
    data = state["campaign_data"]
 
    # Detect ROAS degradation over rolling 7-day window
    signals = analyze_performance_trends(
        data,
        lookback_days=7,
        threshold_roas_drop=0.15,
        threshold_frequency=3.5,
    )
 
    return {"performance_signals": signals}
 
# Wire into graph
builder = StateGraph(AgentState)
builder.add_node("performance", performance_agent)
builder.add_node("creative", creative_agent)
builder.add_node("attribution", attribution_agent)
builder.add_node("benchmark", benchmark_agent)
builder.add_node("alerting", alerting_agent)
 
# Performance and creative run first, then attribution, then benchmark, then alerts
builder.add_edge("performance", "attribution")
builder.add_edge("creative", "attribution")
builder.add_edge("attribution", "benchmark")
builder.add_edge("benchmark", "alerting")
builder.add_edge("alerting", END)

The Annotated[list, operator.add] pattern is key — it lets multiple upstream agents append to the same list field without clobbering each other. The alerting agent sees the full picture.

Evals and Guardrails from Day One

We made a deliberate call early: no agent ships without an eval harness. This sounds obvious, but in practice most teams skip it until something breaks in prod. We've been burned before.

For each agent, we built:

─Unit-level evals — fixed inputs with known expected outputs. "Given this campaign data with a 22% ROAS drop, does the Performance Agent surface a degradation signal?" Pass/fail, run on every commit.
─Golden set comparisons — a set of 50 real historical scenarios with human-labeled ground truth. Used for regression testing when we change prompts or model versions.
─Hallucination guards — the Attribution Agent in particular has a structured output schema with required source fields. If it can't cite the underlying data, it returns a low-confidence signal rather than fabricating certainty.

The single most important engineering decision we made was treating agent outputs as unreliable by default. Every signal that flows between agents carries a confidence score and a source reference. The Alerting Agent won't fire a high-urgency notification unless confidence exceeds a threshold we tuned manually against the golden set.

Model Selection and Cost Controls

We didn't use the same model everywhere. The Performance and Benchmark agents do mostly structured analysis — pattern matching on numeric data with a defined schema. For those, gpt-4o-mini was accurate enough and significantly cheaper.

The Attribution Agent handles messier reasoning — reconciling conflicting signals between platform data, pixel data, and modeled conversions. That one uses gpt-4o.

The Alerting Agent is the most consequential: it decides whether to wake someone up at 2am. We kept it on gpt-4o and added an explicit "pause and reason" step before it commits to high-urgency alerts.

Total cost per daily analysis run at launch: under $0.40. We budgeted for $2 and hit $0.38.

Latency budget: We targeted end-to-end runs under 45 seconds for a typical account (10–30 active campaigns). The graph runs performance/creative in parallel before handing off to attribution. That single parallelization cut wall-clock time from ~70s to ~38s.

What We'd Do Differently

Start with the state schema. We refactored it twice because early agents made assumptions about what fields downstream agents would populate. Define the full TypedDict first, document every field, and don't let agents write to fields they don't own.

Separate analysis from communication. We initially had agents return human-readable summaries. That made downstream parsing fragile. We moved to structured outputs for all inter-agent communication and reserved prose generation for the final alerting layer only.

Run evals before the demo. The client demo was week 3. We had one agent generating slightly over-confident attribution signals. We caught it during an eval run at 11pm the night before. If we'd been running evals from week 1, we would have caught it in week 2.

Is Multi-Agent Worth It?

For single-domain problems, no. A well-prompted single agent with good tools is simpler, cheaper, and easier to debug.

Multi-agent architecture earns its complexity when:

─The problem naturally decomposes into domains with different data sources and reasoning styles
─You need specialization without coupling — the Creative Agent shouldn't care how attribution works
─You want to swap or upgrade one agent without touching the others
─Your failure modes need to be isolated — a broken Attribution Agent shouldn't take down Performance monitoring

The architecture matched the problem. That alignment is what made 5 weeks possible.

If you're building something in this space and want to talk through the tradeoffs, reach out.

The Architecture: Five Agents, One Orchestrator

How Agents Coordinate with LangGraph

Evals and Guardrails from Day One

Model Selection and Cost Controls

What We'd Do Differently

Is Multi-Agent Worth It?

Building something like this?

Keep reading.

AI Features That Actually Work: What to Build Into Your SaaS in 2025

Ship the product. This quarter.

Tell us about your project