Release Readiness Report
Date: 2026-06-26
Build: ./build/bin/cursor-agent
Cycle: Code Search Excellence – Phase 1 Audit
Checklist: docs/release/release_checklist.md
Decision
BLOCKED
The engine is operationally correct. It is not yet product-correct.
Gate 3 (Search Correctness) fails. Release is not authorized.
Gate 1 – Build: PASS
| Check | Result |
|---|---|
| Clean build | ✓ 0 errors |
| Self-test | ✓ 8/8 passed |
| Doctor | ✓ 15/15 passed, 0 failed |
| Benchmark suite | ✓ 14/14 scenarios |
Gate 2 – Execution: PASS
| Check | Result |
|---|---|
| 19/19 audit commands exit 0 | ✓ |
| Tool routing (9/9 direct commands) | ✓ |
--capabilities exits 0 |
✓ (fixed this cycle) |
| Checkpoint contamination | ✓ absent (fixed this cycle) |
| cmake / ctest routing | ✓ correct (fixed this cycle) |
This was the conclusion of the previous readiness report.
It was necessary but insufficient.
Gate 3 – Search Correctness: FAIL
Source: docs/telemetry/search_correctness_report.md
| Metric | Result | Target |
|---|---|---|
| Queries audited | 11 | 11 |
| Semantically correct (PASS) | 2 (18%) | ≥ 90% |
| Partially correct (PARTIAL) | 4 (36%) | – |
| Semantically wrong (FAIL) | 5 (45%) | 0 Critical |
| Confirmed bugs | 8 | 0 |
| Grep fallback rate | 27% | ≤ 20% |
Confirmed semantic failures
| ID | Query | Failure | Severity |
|---|---|---|---|
| B-01 | find struct ToolResult |
Regression fixture ranked above source header | Critical |
| B-02 | what owns CommandRouter |
0 tools executed; classified as General Chat | Critical |
| B-03 | what depends on ExecutionEngine |
0 tools executed; classified as General Chat | Critical |
| B-04 | where is configuration loaded |
Phrase normalized to non-existent symbol; grep matched .md not .cpp |
High |
| B-05 | what is the git diff |
Live git operation misclassified as codebase search | High |
| B-06 | who calls ReplayService |
Declaring header ranked above calling .cpp files |
High |
| B-07 | where is CommandRouter referenced |
Defining header ranked above main.cpp instantiation |
High |
| B-08 | git status |
git log executed instead of git status |
Medium |
Three failure modes
Classifier failures (4 bugs – B-02, B-03, B-05, B-08)
The intent classifier does not recognize ownership (what owns), dependency (what depends on), or natural-language git operations as investigation queries. Two queries returned zero tools executed and confidence 0.0.
Ranking failures (3 bugs – B-01, B-06, B-07)
The file ranker selects the declaring/defining file over the calling/using file for reference queries, and selects fixture files over source files for declaration queries.
Phrase normalization failure (1 bug – B-04)
Compound phrases are joined with underscores, producing symbols that do not exist in the codebase. Grep fallback then matches documentation files instead of source.
Gate 4 – UX: PASS (conditional)
| Check | Result |
|---|---|
| Progress timeline visible | ✓ for investigation queries |
| Evidence visible in JSON | ✓ files_examined populated when tools run |
| Insufficient evidence distinguishable | ✓ outcome: insufficient_evidence correct |
| Errors distinguishable | ✓ stderr separate from stdout |
Condition: UX is acceptable for queries that reach the investigation path. Queries classified as General Chat show no progress UI at all – this is acceptable only while classifier failures remain open.
Gate 5 – Telemetry: PASS
| Check | Result |
|---|---|
| Dashboard current | ✓ 483 sessions, 1408 events |
| Search correctness report | ✓ docs/telemetry/search_correctness_report.md |
| Calibration reviewed | ✓ confidence bands show 97.4% success rate in 0.6–0.8 band |
| Failure topology | ✓ docs/telemetry/failure_topology.md |
What Changed vs. Previous Report
The previous readiness report concluded PASS based on 19/19 audit commands returning exit code 0.
This report supersedes that conclusion.
The previous audit measured operational correctness:
- Does the binary execute?
- Do commands route to the right tool?
- Do exit codes match expectations?
This audit measured product correctness:
- Does the engine return the right file for a given query?
- Does it recognize the query intent?
- Does it rank results correctly?
A code-search agent that consistently returns wrong files is not release-ready even if every command exits with code 0. Exit code 0 is a necessary condition, not a sufficient one.
Remaining Blockers
Critical (must fix before release)
-
Classifier vocabulary – add
what owns,who owns,what depends on,what uses,what includesto the investigation intent trigger vocabulary.
Files:src/app/command_router.cppintent classification logic. -
Path penalty in filename ranker – penalize
scenarios/,data/,docs/,build/paths forfind struct/find classintent queries. Source directoriesinclude/andsrc/must rank first for declaration intents.
Files: filename ranking logic infind_service.cppor equivalent. -
Reference query ranking – for caller/reference queries, rank
.cppfiles above.hfiles. A forward declaration or defining header is not a call site.
Files: ranking logic infind_service.cpporexecution_engine.cpp.
High (should fix before release)
-
Git intent matching – extend git matchers to handle natural language wrappers:
what is the git diff,show me the status,what changed.
Files:command_router.cpp:1523git intent branch. -
Phrase normalization – tokenize multi-word phrases and search each token independently. Do not join with underscores unless the compound identifier appears verbatim in the codebase.
Files: symbol lookup / find_service query normalization.
Medium
- git status vs git log disambiguation – the bare query
git statusshould executegit status, notgit log.
Files:command_router.cpp:1425is_git_status_query().
Recommendation
Fix blockers in this order:
- Classifier vocabulary (B-02, B-03) – highest impact, lowest effort. These are vocabulary additions, not architectural changes.
- Path penalty (B-01) – prevents the most visible regression fixture bug.
- Reference ranking (B-06, B-07) – fixes the most common query pattern for architectural investigation.
- Git intent (B-05, B-08) – relatively contained, high user-facing impact.
- Phrase normalization (B-04) – most complex fix; may require tokenizer changes.
After fixes: re-run docs/telemetry/search_correctness_report.md against the full benchmark set. Target ≥ 90% semantic correctness and ≤ 20% grep fallback rate before re-evaluating release readiness.
This report reflects the state of the repository on 2026-06-26.
Every finding is backed by direct source investigation. No inferences from documentation.