15/15 passed March 2026

Three-Flow Benchmark

We tested DeepPlan's 3 orchestration flows — Baseline, Auto Council, and Paired Debate — against 5 real-world architecture prompts from Healthcare, Logistics, Education, Manufacturing, and Fintech domains.

API: deepplan-website.pages.dev/api/mcp/council · Settings: Auto Persona Pick, Max 8 Experts, AI Search enabled

Overall Score

Quality comparison at a glance

56
/100
Baseline
mode: ai_pure
74
/100
Auto Council
mode: current
81
/100
Paired Debate
mode: debate

Visual Comparison

Radar chart — 5 quality dimensions

SpeedStructureChecklistEdge CasesDepth
Baseline 56/100
Auto Council 74/100
Paired Debate 81/100

Category Breakdown

Scores by dimension

Speed (30% weight)

Measured by response time (lower is better)

Baseline
100/100
Auto Council
49/100
Paired Debate
36/100

Structure (25% weight)

Measured by section count and organization

Baseline
57/100
Auto Council
96/100
Paired Debate
100/100

Checklist (20% weight)

Measured by verification checklist item count

Baseline
0/100
Auto Council
57/100
Paired Debate
100/100

Edge Cases (15% weight)

Measured by risk, fallback, and race condition coverage

Baseline
27/100
Auto Council
91/100
Paired Debate
100/100

Depth (10% weight)

Measured by output length and implementation detail

Baseline
78/100
Auto Council
100/100
Paired Debate
100/100

Raw Performance

Wall-clock time, output, and expert count

PromptTime (s)Output (chars)Experts
BaseCouncilDebateBaseCouncilDebateBaseCouncilDebate
P1 Healthcare20s41s47s4,8216,2286,261667
P2 Food Delivery20s38s50s4,4675,9425,883688
P3 University LMS18s38s55s5,0786,0055,725876
P4 Supply Chain19s41s51s4,6046,2016,762565
P5 P2P Lending17s35s57s4,8366,1396,061668
Average18.8s38.6s52s4,7616,1036,0986.26.66.8

Test Prompts

5 real-world architecture scenarios

Methodology

How we scored

Scoring Weights

Speed 30%
Structure 25%
Checklist 20%
Edge Cases 15%
Depth 10%

Normalization

  • Speed: fastest mode = 100, scale by fastest / mode_time
  • Other metrics: best mode = 100, scale proportionally
  • Overall: weighted sum of 5 category scores

Test Configuration

Selection Mode
Auto Persona Pick
Max Experts
8
AI Search
Enabled (Vectorize cache)

Key Findings

What we learned

Context-aware Chair selection

Security Architect was chosen as Chair for billing and API gateway prompts. Product Manager for healthcare. DevOps Architect for IoT and infrastructure. The AI picker correctly matches expertise to domain.

Debate catches more risks

Paired Debate produces 75% more checklist items than Auto Council. The Pro/Con self-critique forces each expert to adversarially review their own proposals, surfacing hidden risks.

AI Search cache hit on all calls

All 15 benchmark calls hit the knowledge base cache. No web search fallback was needed. The self-improving KB harvests context from previous runs.

Expert count adapts to complexity

The Auto Picker selected 5–8 experts depending on prompt complexity. Simpler domains got 5 experts, complex multi-system prompts (IoT, fintech) got up to 8. Max setting was respected.

Try it yourself

Run your own architecture plans through Baseline, Auto Council, or Paired Debate.