Three-Flow Benchmark
We tested DeepPlan's 3 orchestration flows — Baseline, Auto Council, and Paired Debate — against 5 real-world architecture prompts from Healthcare, Logistics, Education, Manufacturing, and Fintech domains.
API: deepplan-website.pages.dev/api/mcp/council · Settings: Auto Persona Pick, Max 8 Experts, AI Search enabled
Overall Score
Quality comparison at a glance
Visual Comparison
Radar chart — 5 quality dimensions
Category Breakdown
Scores by dimension
Speed (30% weight)
Measured by response time (lower is better)
Structure (25% weight)
Measured by section count and organization
Checklist (20% weight)
Measured by verification checklist item count
Edge Cases (15% weight)
Measured by risk, fallback, and race condition coverage
Depth (10% weight)
Measured by output length and implementation detail
Raw Performance
Wall-clock time, output, and expert count
| Prompt | Time (s) | Output (chars) | Experts | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Base | Council | Debate | Base | Council | Debate | Base | Council | Debate | |
| P1 Healthcare | 20s | 41s | 47s | 4,821 | 6,228 | 6,261 | 6 | 6 | 7 |
| P2 Food Delivery | 20s | 38s | 50s | 4,467 | 5,942 | 5,883 | 6 | 8 | 8 |
| P3 University LMS | 18s | 38s | 55s | 5,078 | 6,005 | 5,725 | 8 | 7 | 6 |
| P4 Supply Chain | 19s | 41s | 51s | 4,604 | 6,201 | 6,762 | 5 | 6 | 5 |
| P5 P2P Lending | 17s | 35s | 57s | 4,836 | 6,139 | 6,061 | 6 | 6 | 8 |
| Average | 18.8s | 38.6s | 52s | 4,761 | 6,103 | 6,098 | 6.2 | 6.6 | 6.8 |
Test Prompts
5 real-world architecture scenarios
Methodology
How we scored
Scoring Weights
Normalization
- Speed: fastest mode = 100, scale by
fastest / mode_time - Other metrics: best mode = 100, scale proportionally
- Overall: weighted sum of 5 category scores
Test Configuration
Key Findings
What we learned
Context-aware Chair selection
Security Architect was chosen as Chair for billing and API gateway prompts. Product Manager for healthcare. DevOps Architect for IoT and infrastructure. The AI picker correctly matches expertise to domain.
Debate catches more risks
Paired Debate produces 75% more checklist items than Auto Council. The Pro/Con self-critique forces each expert to adversarially review their own proposals, surfacing hidden risks.
AI Search cache hit on all calls
All 15 benchmark calls hit the knowledge base cache. No web search fallback was needed. The self-improving KB harvests context from previous runs.
Expert count adapts to complexity
The Auto Picker selected 5–8 experts depending on prompt complexity. Simpler domains got 5 experts, complex multi-system prompts (IoT, fintech) got up to 8. Max setting was respected.
Try it yourself
Run your own architecture plans through Baseline, Auto Council, or Paired Debate.