Human Evaluation Package – Planner Investigation Quality

Purpose

Evaluate whether the planner investigates like a senior engineer.

The evaluator should be someone unfamiliar with this repository.

Instructions

Run each question through cursor-agent and observe the output.
Evaluate each answer against the criteria below.
Record scores in the scoring sheet.
The planner passes if ≥16/20 questions score ≥3 on correctness.

How to Run

# Run a single question
./build/bin/cursor-agent --json "<question>" 2>/dev/null

# Or use the inspect command in interactive mode:
# Type the question, then press 'i' when prompted

20 Architecture Questions

Q1: Initialization Order

Question: Find the initialization order in Agent::run() to see why logging is set up before the main interaction loop

Expected answer: Agent::run() in src/agent.cpp:27-33 creates UIManager first, then ReplayService, then CommandRouter, then Session. UIManager is created first because log output infrastructure is needed before any commands can be processed. Session::run() is the last step and enters the interaction loop.

Evidence to find: src/agent.cpp:27-33

Q2: Diagnostics Isolation

Question: Find how the diagnostics module is separated from the execution engine in the source code

Expected answer: Diagnostics lives in src/diagnostics/diagnostics.cpp as a standalone module. It creates its own ExecutionEngine instance and tool runners. It does NOT share state with the main Agent – it’s an independent verification harness that simulates tool calls rather than using the real ones.

Evidence to find: src/diagnostics/diagnostics.cpp, the run_query() function

Q3: Lifecycle Shutdown

Question: Find the component responsible for lifecycle shutdown in the codebase

Expected answer: There is no explicit shutdown sequence. Session::run() uses while(true) and returns normally when the user types “exit” or “quit”. Agent objects are unwound by C++ destructor order. No dedicated shutdown component exists – the architecture relies on RAII and scope-based cleanup.

Q4: Command Routing

Question: Trace how a user command reaches the execution engine from the terminal prompt

Expected answer: The path is: main() → Agent::run() → Session::run() → read_prompt() → CommandRouter::process_user_input(). Inside process_user_input, the classification ladder runs: shell mode check → @ injection → ! shell → / meta → direct command → NL mapping → ExecutionEngine::execute(). The engine creates a fresh instance, runs evidence collection, and returns ExecutionResult.

Evidence to find: session.cpp, command_router.cpp, execution_engine.cpp

Q5: Tool Exhaustion

Question: Find what happens when the execution engine exhausts all its tool calls but still lacks evidence

Expected answer: When select_next_tool() returns empty (all tools are exhausted), select_recovery_tool() is called if recovery_count < 3. The recovery tool chooses a strategy based on evidence state (e.g., find failed → try grep, grep found results → try read, etc.). If recovery also fails or is exhausted, the loop exits with stopped_early = true.

Evidence to find: execution_engine.cpp:1150-1157 (recovery on tool exhaustion)

Q6: ReplayService Separation

Question: Find how the ReplayService is separated from the Agent class in the codebase

Expected answer: ReplayService is created separately in Agent::run() and passed to Session and CommandRouter as a pointer. It’s never stored inside Agent. It logs SessionState snapshots before and after each command to ~/.cursor/replay/. The separation means replay can be disabled by passing nullptr.

Evidence to find: agent.cpp:27-33, replay_service.h/cpp

Q7: Confidence Gating

Question: Find where the confidence gating prevents the AI from answering without sufficient evidence

Expected answer: In command_router.cpp, should_call_ai(result) checks result.outcome. Only Success outcome allows AI call. Non-Success outcomes produce deterministic messages directly. Additionally, execution_engine.cpp gates on combined confidence < 0.2 triggering early stop and < 0.7 triggering post-completion recovery.

Evidence to find: command_router.cpp (should_call_ai), execution_engine.cpp (confidence checks)

Q8: InvestigationSession

Question: Find the InvestigationSession struct and trace how it bridges execution to the user

Expected answer: InvestigationSession is defined in include/core/investigation_session.h. It’s created via from_result() which maps ExecutionResult fields (tools, files, evidence, confidence, outcome) into the struct. It’s stored in SessionState::last_investigation. The /inspect command reads this field and displays it.

Evidence to find: investigation_session.h, command_router.cpp (handle_inspect_command)

Q9: Deterministic vs LLM Classification

Question: Find where the planner supports both deterministic and LLM-based classification paths

Expected answer: ExecutionEngine has two classification modes: Deterministic and LLM. The mode is controlled by classifier_mode_. select_next_tool() dispatches to select_next_tool() or select_next_tool_llm() based on mode, and classify_goal() similarly dispatches to classify_goal() or classify_goal_llm(). The deterministic mode uses regex/heuristic matching; the LLM mode calls AIService.

Evidence to find: execution_engine.h, execution_engine.cpp (classify_goal, select_next_tool)

Q10: Recovery Loop Prevention

Question: Find how the recovery loop in select_recovery_tool() prevents infinite loops

Expected answer: Three mechanisms: (1) MAX_RECOVERY = 3 limit in execute(), (2) seen_tool_calls deduplication prevents running the same tool+args twice, (3) recovery strategies are evidence-driven – they only fire when specific evidence facts are missing, so they naturally terminate when all evidence gaps are filled.

Evidence to find: execution_engine.cpp

Q11: Classification Ladder

Question: Find how the CommandRouter determines whether a query goes to the engine vs a direct handler

Expected answer: The classification ladder in process_user_input() is a strict priority chain: (1) shell mode active → shell handler, (2) contains @ → file injection, (3) starts with ! → shell escape, (4) starts with / → meta command, (5) direct command match → direct handler, (6) NL-to-direct mapping → direct handler, (7) falls through to ExecutionEngine. First match wins.

Evidence to find: command_router.cpp (process_user_input)

Q12: Discovery-Planning Relationship

Question: Find the relationship between discovery, planning, and the task pipeline in the source code

Expected answer: DiscoveryService::scan() detects project type, CI, relevant files. PlanningService::generate_plan() takes the discovery result and user query to create a structured task list. handle_task_with_planning() in command_router.cpp orchestrates: discovery → plan generation → user approval → execution → evidence collection. Discovery feeds the plan; the plan feeds the task pipeline.

Evidence to find: discovery_service.cpp, planning_service.cpp, command_router.cpp

Q13: Metrics Population

Question: Trace how RecoveryMetrics, TrustMetrics, and RetrievalMetrics get populated after execution

Expected answer: RecoveryMetrics is computed in ExecutionEngine::execute() lines ~1500-1520 (tool attempt counts, evidence found, confidence delta). TrustMetrics is populated in CommandRouter::process_user_input() (plan_approved, diff_approved from user interaction). RetrievalMetrics is populated in CommandRouter::handle_codebase_query() (filename/symbol/directory/grep hits). All are stored in SessionState after each command.

Evidence to find: execution_engine.cpp, command_router.cpp, metrics.h

Q14: Inspect Command

Question: Find how the inspect command (/inspect) retrieves and displays the last investigation session

Expected answer: handle_inspect_command() in command_router.cpp reads agent_.state_.last_investigation (an optional<InvestigationSession>). If absent, prints “No investigation data”. If present, formats the struct fields (goal, conclusion, confidence, tools_used, files_examined, symbols_found, evidence_summary) into a human-readable report. Also, after each answer in Session::run(), a 3-second window waits for ‘i’ key to trigger inspect.

Evidence to find: command_router.cpp (handle_inspect_command), session.cpp (i key handler)

Q15: EvidenceCollector

Question: Find the role of EvidenceCollector and when it is triggered in the codebase

Expected answer: EvidenceCollector is a private class inside command_router.cpp. It’s used in handle_task_with_planning() after AI execution. It lazily collects per-task evidence: runs cmake build → ctest → git diff. Results are cached per file path. It provides has_collected(), build_evidence(), test_evidence(), and diff_evidence() accessors. It’s only triggered in the task pipeline path, not during regular engine execution.

Evidence to find: command_router.cpp (EvidenceCollector class, handle_task_with_planning)

Q16: Recovery Strategies

Question: Find the recovery strategies in select_recovery_tool() and the order they are evaluated

Expected answer: Five strategies evaluated in order: (1) find:noresults + no grep → grep, (2) grep:results + no read → read, (3) find:results + read + no grep → grep, (4) no evidence + no discovery → discovery, (5) header examined + no impl → find –impl / .cpp examined + no header → find header.

Evidence to find: execution_engine.cpp (select_recovery_tool)

Q17: Confidence Combination

Question: Find how the confidence service computes combined confidence across multiple tools

Expected answer: Each tool call produces a ConfidenceResult with score and reason. ConfidenceService::combine() in confidence_service.cpp aggregates these: takes the average of all tool scores, then applies a dampening factor (0.9) for each tool beyond the first to prevent runaway confidence from many weak tools.

Evidence to find: confidence_service.cpp

Q18: Zero Results Recovery

Question: Find what happens when both find and grep return no results for a codebase query

Expected answer: Strategy 1 fires: find returns noresults → grep is attempted. If grep also returns noresults → strategy 4 fires: discovery is attempted (provides project structure, source counts, CI info). If all recovery strategies are exhausted, the loop exits with stopped_early = true and outcome = InsufficientEvidence. The confidence is low (< 0.2) so should_call_ai() returns false.

Evidence to find: execution_engine.cpp (select_recovery_tool, should_stop check)

Q19: ArchitectureReview vs CodebaseQuery

Question: Find how the ArchitectureReview goal type differs from CodebaseQuery in tool selection

Expected answer: CodebaseQuery uses: find (or references for callers) → read → grep. ArchitectureReview uses: discovery → git log → multi-step grep (AgentMode, MODE_, AuthProvider, strategy_changes) → multi-file read (session_state.h, metrics.h, execution_engine.cpp) → read tests/validation_runner.cpp. ArchitectureReview is a fixed script; CodebaseQuery adapts to the query.

Evidence to find: execution_engine.cpp (select_next_tool for each goal type)

Q20: Tool Deduplication

Question: Find how tool deduplication works in the execution engine and when it triggers recovery

Expected answer: Each tool+args pair is serialized to a tc_signature string. Before executing a tool, the engine checks seen_tool_calls set. If the signature exists, the tool is skipped and recovery is attempted (up to 3 times). If recovery_count >= 3 and deduplication fires, stopped_early = true and the loop exits. This prevents the engine from repeating the same failed search.

Evidence to find: execution_engine.cpp (seen_tool_calls logic)

Evaluation Rubric

Each question is scored 1-5:

Score	Label	Description
5	Excellent	Answer is correct AND cites specific files/lines. Investigation path is clear.
4	Good	Answer is correct. May be slightly vague on exact file/line references.
3	Adequate	Answer is directionally correct. Investigation found relevant code but missed nuances.
2	Poor	Answer is partially wrong or misses key evidence. Investigation path was confused.
1	Failing	Answer is wrong. Investigation failed or never ran tools.

Pass threshold: ≥16/20 questions must score ≥3.

Scoring Sheet

Q#	Question	Correctness (1-5)	Investigation Clear? (Y/N)	/inspect Useful? (Y/N)	Recovery Visible? (Y/N)	Notes
1	Initialization Order
2	Diagnostics Isolation
3	Lifecycle Shutdown
4	Command Routing
5	Tool Exhaustion
6	ReplayService Separation
7	Confidence Gating
8	InvestigationSession
9	Deterministic vs LLM
10	Recovery Loop Prevention
11	Classification Ladder
12	Discovery-Planning
13	Metrics Population
14	Inspect Command
15	EvidenceCollector
16	Recovery Strategies
17	Confidence Combination
18	Zero Results Recovery
19	ArchitectureReview
20	Tool Deduplication

Summary:

Total questions scoring ≥3: ___ / 20
Average correctness: ___
Investigation clear: ___ / 20
/inspect useful: ___ / 20
Recovery visible: ___ / 20

Recommendation (circle one): PASS / FAIL / CONDITIONAL

Evaluator notes: _______________________ _________________________ ___________________________