15/15 passed March 2026

Three-Flow Benchmark

We tested DeepPlan's 3 orchestration flows — Baseline, Auto Council, and Paired Debate — against 5 real-world architecture prompts from Healthcare, Logistics, Education, Manufacturing, and Fintech domains.

API: deepplan-website.pages.dev/api/mcp/council · Settings: Auto Persona Pick, Max 8 Experts, AI Search enabled

Overall Score

Quality comparison at a glance

/100

Baseline

mode: ai_pure

/100

Auto Council

mode: current

/100

Paired Debate

mode: debate

Visual Comparison

Radar chart — 5 quality dimensions

Baseline 56/100

Auto Council 74/100

Paired Debate 81/100

Category Breakdown

Scores by dimension

Speed (30% weight)

Measured by response time (lower is better)

Baseline

100/100

Auto Council

49/100

Paired Debate

36/100

Structure (25% weight)

Measured by section count and organization

Baseline

57/100

Auto Council

96/100

Paired Debate

100/100

Checklist (20% weight)

Measured by verification checklist item count

Baseline

0/100

Auto Council

57/100

Paired Debate

100/100

Edge Cases (15% weight)

Measured by risk, fallback, and race condition coverage

Baseline

27/100

Auto Council

91/100

Paired Debate

100/100

Depth (10% weight)

Measured by output length and implementation detail

Baseline

78/100

Auto Council

100/100

Paired Debate

100/100

Raw Performance

Wall-clock time, output, and expert count

Prompt	Time (s)			Output (chars)			Experts
	Base	Council	Debate	Base	Council	Debate	Base	Council	Debate
P1 Healthcare	20s	41s	47s	4,821	6,228	6,261	6	6	7
P2 Food Delivery	20s	38s	50s	4,467	5,942	5,883	6	8	8
P3 University LMS	18s	38s	55s	5,078	6,005	5,725	8	7	6
P4 Supply Chain	19s	41s	51s	4,604	6,201	6,762	5	6	5
P5 P2P Lending	17s	35s	57s	4,836	6,139	6,061	6	6	8
Average	18.8s	38.6s	52s	4,761	6,103	6,098	6.2	6.6	6.8

Test Prompts

5 real-world architecture scenarios

Methodology

How we scored

Scoring Weights

Speed 30%

Structure 25%

Checklist 20%

Edge Cases 15%

Depth 10%

Normalization

Speed: fastest mode = 100, scale by fastest / mode_time
Other metrics: best mode = 100, scale proportionally
Overall: weighted sum of 5 category scores

Test Configuration

Selection Mode

Auto Persona Pick

Max Experts

AI Search

Enabled (Vectorize cache)

Key Findings

What we learned

Context-aware Chair selection

Security Architect was chosen as Chair for billing and API gateway prompts. Product Manager for healthcare. DevOps Architect for IoT and infrastructure. The AI picker correctly matches expertise to domain.

Debate catches more risks

Paired Debate produces 75% more checklist items than Auto Council. The Pro/Con self-critique forces each expert to adversarially review their own proposals, surfacing hidden risks.

AI Search cache hit on all calls

All 15 benchmark calls hit the knowledge base cache. No web search fallback was needed. The self-improving KB harvests context from previous runs.

Expert count adapts to complexity

The Auto Picker selected 5–8 experts depending on prompt complexity. Simpler domains got 5 experts, complex multi-system prompts (IoT, fintech) got up to 8. Max setting was respected.

Try it yourself

Run your own architecture plans through Baseline, Auto Council, or Paired Debate.

Get Started Read Docs