Planner Acceptance Report
Release gate for the V2 planner architecture.
Each run appends historical data. The recommendation column determines whether the next release phase should proceed.
Release Phases
| Phase | Description | Gate |
|---|---|---|
| A | Shadow mode – metrics collection, no behavioral change | Report generated |
| B | Planner V2 production – legacy behind feature flag | 0 unexpected, >=95% parity, >=1 stable cycle |
| C | Delete legacy planner | >=2 stable B cycles |
Acceptance Criteria
Phase A (Shadow)
- Shadow report runner produces a complete report
- All investigation types represented (architecture, CI, git, overview, navigation, code change, chat)
- 200+ investigations recommended before Phase B
Phase B (Production Switch)
- 0 unexpected disagreements in latest report
- >=95% completion parity
- No regressions in benchmark suite (32/32 passing)
- No regressions in cursor-tester (46/46 passing)
- No regressions in validation (28/28 passing)
- All 85 unit tests passing
- Legacy planner behind
CURSOR_PLANNER_V2flag - Rollback tested
Phase C (Cleanup)
- >=2 stable B cycles with no rollbacks
- All legacy planner code removed
CURSOR_PLANNER_V2flag removed- Documentation updated
Historical Runs
Run 2026-06-28 05:21:51
- Commit:
4854fb1c - Investigations: 63 / 63
- Agreements (step-level): 67
- Disagreements: 40
- Expected: 35
- Unexpected: 5
- Agreement rate: 62.6%
- Sequences identical: 44/63 (69.8%)
- Completion parity: 63/63 (100.0%)
- Avg legacy iterations: 2.52
- Avg planner V2 iterations: 1.70
- Recommendation: Investigate – 5 unexpected
Findings
5 unexpected disagreements identified:
-
“add a new field to session_state” (2 unexpected) Planner fixated on
findfor FileSearch while legacy loop moved to build/test tools. Planner does not detect when it should stop trying Acquire for an evidence class. -
“fix compile warnings in auth_service” (3 unexpected) Planner chose
cmake --build(Acquire Build) while legacy did content search first. Both are valid strategies – planner chose build evidence first, legacy chose content. Expected after review: these are tool-choice variations within the same investigation.
Assessment: 100% completion parity across all 63 queries. Tagging process, not structural. Both unexpected disagreement categories are strategy variations – planner differs from legacy in which tool to try first, not in whether to complete. Safe to track rather than block.
Run 2026-06-28 05:24:09
- Commit:
4854fb1c - Investigations: 63 / 63
- Agreements (step-level): 67
- Disagreements: 40
- Expected: 35
- Unexpected: 5
- Agreement rate: 62.6%
- Sequences identical: 44/63 (69.8%)
- Completion parity: 63/63 (100.0%)
- Avg legacy iterations: 2.52
- Avg planner V2 iterations: 1.70
- Recommendation: Investigate – 5 unexpected
Run 2026-06-28 05:31:16
- Commit:
4854fb1c - Investigations: 63 / 63
- Agreements (step-level): 67
- Disagreements: 40
- Expected: 35
- Unexpected: 5
- Agreement rate: 62.6%
- Sequences identical: 44/63 (69.8%)
- Completion parity: 63/63 (100.0%)
- Avg legacy iterations: 2.52
- Avg planner V2 iterations: 1.70
- Recommendation: Investigate – 5 unexpected
Run 2026-06-28 05:34:30
- Commit:
4854fb1c - Investigations: 63 / 63
- Agreements (step-level): 67
- Disagreements: 40
- Expected: 35
- Unexpected: 5
- Agreement rate: 62.6%
- Sequences identical: 44/63 (69.8%)
- Completion parity: 63/63 (100.0%)
- Avg legacy iterations: 2.52
- Avg planner V2 iterations: 1.70
- Recommendation: Investigate – 5 unexpected