Why Final-Answer Eval Fails for Agents

Traditional evaluation focuses on final outputs. For AI agents, this approach is dangerously incomplete. An agent can reach a correct final answer through a path that violates safety constraints, wastes resources, or reveals confidential information mid-trajectory. The journey matters as much as the destination.

Consider a customer service agent answering a refund question. It might provide the correct answer (customer is eligible for refund) but only after attempting to access the customer's payment history, accidentally logging internal product costs to the conversation, and trying five different knowledge base queries that exposed search algorithm patterns to the customer. A traditional final-answer evaluation would mark this as correct. A trajectory evaluation immediately flags multiple problems.

Safety in particular demands trajectory-level assessment. An agent recovering from errors gracefully is fundamentally safer than one plowing forward with wrong information. An agent recognizing its knowledge boundaries is better than one hallucinating tool calls. An agent using efficient search strategies leaves smaller attack surfaces than one that must call every available tool. These properties live in the trajectory.

Real-world deployments require trajectory visibility. When something goes wrong, you need to understand exactly what the agent did wrong, not just that the final answer was bad. Trajectory evaluation is therefore not optional for production-ready agent assessment.

The Trajectory Evaluation Framework

A trajectory is the complete sequence of steps an agent takes to solve a problem. Each trajectory consists of a series of (state, action, observation) triples: the agent is in a state, takes an action, observes the result, and transitions to a new state.

Formal Definition: A trajectory is a sequence T = [s₀, a₀, o₀, s₁, a₁, o₁, ..., sₙ, aₙ, oₙ] where sᵢ is the state (including context, previous observations, current goal), aᵢ is the action taken by the agent (e.g., "call_tool", "retrieve_document", "synthesize_answer"), and oᵢ is the observation resulting from that action.

Trajectory Length Distributions: Track how many steps different agent behaviors require. A well-designed agent might solve 80% of problems in 3-5 steps, 15% in 6-8 steps, and 5% in 9+ steps. If an agent is consistently requiring 15+ steps for tasks that should take 3-5, it's inefficient. If trajectory length varies wildly, the agent may lack consistent reasoning patterns.

Trajectory Replay and Simulation: Store complete trajectories including tool calls, API responses, and reasoning intermediate outputs. Being able to replay a trajectory offline is essential for debugging and post-hoc evaluation. This requires careful logging of: agent reasoning/thought processes, tool invocations and parameters, tool response payloads, state transitions, final action rationale.

Trajectory vs. Execution Paths: Distinguish between logical trajectories (what reasoning steps the agent should follow) and actual execution paths (what the agent actually did). An agent might take action "look_up_inventory" but due to network latency or API changes, observe something unexpected. Evaluation should flag when the execution path diverges significantly from expected logical trajectories.

Step-Level Quality Scoring

Each action in the trajectory receives its own quality score. Aggregate step scores to produce trajectory quality. The key is scoring on dimensions that reveal different quality aspects.

Action Relevance: Does this action move toward solving the problem? Or is it tangential, redundant, or exploring dead ends? Relevance scoring (0-5 scale) identifies agents that generate lots of busywork. A search agent that queries the same knowledge base twice redundantly shows poor relevance on the second query.

Action Correctness: Did the action actually accomplish what the agent intended? This requires understanding the agent's stated goal for each action. An agent that intends to "retrieve pricing information" but calls a tool that returns product specifications has incorrect action selection, even if the tool invocation was syntactically valid.

Safety Score: Does this action violate security or privacy constraints? Check: does the action access sensitive data unnecessarily? Does it make assumptions about user permissions? Does it expose confidential information to the user? Safety violations at any step are serious flags. A single privacy breach can negate many otherwise-good actions.

Efficiency Metric: Is this the most direct way to accomplish the step's goal? An agent that needs to find a fact might search 5 different sources or 1. Both might find the answer eventually, but the efficient agent demonstrates better reasoning. Track whether redundant actions could have been avoided.

Aggregation Strategy: Don't just average step scores. A trajectory with 8 good steps and 1 catastrophic error (safety violation) should not score as "good". Use: weighted aggregation (critical safety steps weighted heavily), floor functions (one safety violation floors the entire trajectory score), or multi-metric reporting (report safety separately from efficiency so stakeholders see the trade-off).

Tool Use Accuracy Evaluation

For agents that use external tools, tool invocation accuracy is a distinct evaluation dimension. This includes: selecting the right tool, calling it with correct parameters, and handling the response appropriately.

Tool Selection Accuracy: Did the agent call the right tool? Many agents have access to multiple tools with overlapping functionality. Using the wrong tool might still work (reading from a cache when you should query the database) but indicates shallow understanding. Measure: percentage of tool calls where the selected tool is optimal for the stated goal. Target: 90%+.

Parameter Accuracy: When the agent calls a tool, are the parameters correct? Examples: passing the right product ID (not a typo), formatting dates correctly, choosing correct filter values. Parameter errors are common in agentic systems—models understand which tool to use but make mistakes in parameter selection. Measure: percentage of tool calls where all parameters are correct or acceptable variations. Invalid parameters that cause tool failure should be caught as errors.

Parameter Ordering: For some tools, the order of parameters matters (calling search with query then filters works, reversed might not). Track whether agents consistently use correct parameter ordering or if they require multiple attempts. This suggests whether the agent has truly internalized tool semantics or is guessing.

Tool Sequence Optimization: A working sequence of tool calls might not be optimal. An agent that calls tool A, then B, then A again has non-optimal ordering. Compute: the edit distance between the agent's actual sequence and the oracle optimal sequence. Low edit distance indicates the agent understands workflow optimization; high distance suggests myopic reasoning.

Handling Tool Failures: When a tool call fails, does the agent recover sensibly? Measure: percentage of tool failures followed by appropriate recovery actions. An agent that calls a tool with invalid parameters, gets an error, and then correctly modifies parameters and retries shows good error recovery. One that ignores the error and continues is a problem.

Error Recovery Assessment

How an agent responds to failures reveals a lot about its reasoning quality. Some agents cascade errors (one mistake leads to worse mistakes downstream). Good agents recognize errors and recover.

Error Cascade Detection: Track whether errors in early steps contaminate later steps. If the agent retrieves wrong information in step 2 and then makes decisions based on that wrong information, it's error cascading. Detect this by comparing: optimal reasoning path given ground truth, vs. agent's reasoning path given what it actually observed. Agents that adapt reasoning based on actual observations cascade less; agents that follow predetermined scripts cascade more.

Graceful Degradation: When an agent encounters a problem it can't solve, does it continue trying, give up entirely, or admit limitations? Measure: at what point does an agent stop trying? Does it give up after 1 failed tool call or 10? The answer depends on context, but consistency is key. Graceful agents should recognize unsolvable problems quickly (after 3-5 reasonable attempts) and report limitations.

Error Explanation Quality: When the agent encounters an error (tool returns unexpected output), does it explain what went wrong? An agent that says "the search returned no results" shows understanding. One that doesn't acknowledge the error and continues shows poor reasoning transparency. Qualitatively evaluate agent explanations when errors occur.

Self-Correction: Can the agent correct its own mistakes? Measure: percentage of errors caught and corrected by the agent before proceeding. An agent that makes an error (calls tool with wrong parameters), notices the error in the response, and corrects it demonstrates self-correction capability. This is highly correlated with real-world success.

Efficiency Evaluation

Efficiency matters both for user experience (faster response) and cost (fewer tool calls, less compute). Measure efficiency using multiple metrics.

Steps-to-Completion Ratio: How many steps did the agent take compared to the oracle optimal trajectory length? Compute: steps_taken / oracle_length. Ratio of 1.0 means optimal. Ratio of 2.0 means twice as many steps as necessary. For most domains, acceptable ratios are 1.0-1.5. Ratios above 2.0 indicate the agent is very inefficient.

Oracle Trajectory Estimation: You need an oracle trajectory (the optimal sequence of steps a perfect agent would take). This comes from: expert demonstrations (have a human solve the problem, record their steps), algorithm analysis (for deterministic problems, compute the minimal steps required), or crowd consensus (multiple humans solve; count minimum steps any human took).

Tool Call Cost Optimization: If tool calls have different costs, measure not just number of calls but cost per solution. An agent that needs 3 cheap calls vs. 1 expensive call might still be more efficient overall. Compute: total_cost / solutions_completed. Benchmark against simpler baselines.

Latency Distribution: Measure not just average solution time but the distribution. An agent with 95th percentile latency of 60 seconds is problematic even if median is 3 seconds. Report: min, 25th percentile, median, 75th percentile, 95th percentile, and max latency. Identify outliers requiring investigation.

Safety at Each Step

Safety violations deserve special emphasis. A single serious violation can disqualify an agent from production deployment, regardless of how efficient or accurate it otherwise is.

Unsafe Intermediate Actions: An agent might reach a correct final answer through unsafe intermediate steps. Example: a customer service agent reaches the correct refund decision but only after accessing the customer's full purchase history (should have only checked their recent order). The final answer is right; the path was wrong.

Reversibility Assessment: Some actions are reversible (query a knowledge base—no permanent effects). Others are irreversible (transfer money, delete data). Track whether agents distinguish these categories appropriately. An agent that treats reversible and irreversible actions the same way is dangerous. Expect agents to be more cautious with irreversible actions (more validation before proceeding).

Unnecessary Data Access: Does the agent access data it doesn't need? An agent solving "what's the customer's account status?" should check account status, not pull billing history, contact information, and payment methods. Measure: what percentage of data accessed is actually used in reasoning? High percentage indicates good data minimization; low percentage indicates over-access.

Confidence Calibration in Uncertainty: When an agent doesn't know something, does it admit it? Or does it hallucinate answers? Measure: when evaluation reveals the agent produced wrong information, was there evidence the agent was uncertain but proceeded anyway? Agents that clearly state uncertainty are safer (humans can fact-check) than those that confidently generate false information.

Harmful Content Generation: Does the agent ever generate harmful content (biased statements, hate speech, illegal advice, manipulation)? This is a binary pass/fail criterion. Single incident = failure. Measure across trajectories: percentage of agents that never generated harmful content.

Hallucinated Actions and Phantom Tool Calls

Some agents generate action descriptions for tools that don't exist or with parameters that don't make sense. These hallucinations are critical failure modes requiring specific detection.

Non-Existent Tool Calls: Detect when an agent calls a tool that isn't in its available tools. Example: agent states "calling_deprecated_api_v1()" but that tool was never registered. This is a critical failure. Measure: percentage of trajectories containing calls to non-existent tools. Target: 0%.

Invalid Parameters: Some parameters are logically invalid. An agent calling "search_database(date='invalid')" has hallucinated. Detect by type checking, range checking, and format validation. An agent calling "search_database(limit=-5)" is requesting negative results (invalid). Measure: what percentage of tool calls have invalid parameters that would cause failure? This is slightly different from parameter accuracy—it's specifically about logical invalidity.

Phantom Intermediate States: Some agents generate descriptions of states that never actually occurred. Example: agent says "I checked the database and found 50 matching customers" but the database call was never actually made. This indicates the agent is hallucinating reasoning rather than reflecting actual execution. Detect by comparing agent-stated observations against actual tool outputs. Measure: percentage of agent-stated observations that match actual tool results.

Reasoning Hallucination vs. Tool Hallucination: Distinguish between: (1) hallucinated reasoning (agent imagines facts that weren't in tool outputs), (2) hallucinated tool calls (agent invokes tools that don't exist). Both are serious but require different mitigations. Report separately so you can understand the failure modes.

Human-in-Loop Trajectory Review

Automated metrics catch some issues but miss nuanced problems. Strategic human review of trajectories is essential.

Review Protocol Design: Have expert reviewers evaluate 50-100 random agent trajectories. Provide a structured form: (1) Did the agent solve the task correctly? (Yes/No), (2) Were there any safety or ethical concerns? (Yes/No, describe), (3) Was the approach reasonable? (1-5 scale), (4) Any hallucinations or errors? (Yes/No, describe), (5) What should improve? (open text).

Reviewer Expertise Requirements: Reviewers should be domain experts. For a customer service agent, reviewers should be experienced customer service representatives who understand policies and edge cases. For a medical diagnostic agent, reviewers should be clinicians. Generic review is insufficient.

Focus Areas for Reviewers: Rather than reviewing entire trajectories, focus attention on: (1) trajectories where automated metrics are borderline, (2) trajectories involving sensitive decisions (financial, medical, privacy), (3) trajectories that took significantly more steps than expected, (4) a random sample to catch issues automated metrics miss.

Adjudication Protocol: When reviewers disagree, have a more senior expert resolve disagreement. Track disagreement rate as a signal of evaluation clarity—high disagreement suggests the criteria are ambiguous. Kappa statistics work here too: measure inter-reviewer agreement.

Building Agent Eval Datasets

High-quality trajectory evaluation requires good datasets of test scenarios with known ground truth.

Real Interaction Capture: The gold standard is recording actual agent interactions during live use (with appropriate privacy safeguards). These trajectories capture real difficulty and edge cases better than synthetic scenarios. If your agent is deployed in limited capacity, capture these interactions for evaluation. They're invaluable.

Synthetic Scenario Generation: For new agents pre-deployment, you can't capture real interactions. Generate realistic scenarios: identify common use cases, edge cases, error conditions, and malicious inputs. For each scenario, document: the user query/request, the information the agent should have access to, the correct solution, difficulty level (easy/medium/hard), and potential pitfalls.

Trajectory Annotation: Once you have scenarios, have expert humans (domain experts, not just crowdworkers) solve them. Record and annotate their trajectories: at each step, what action did they take? Why? Was there a better alternative? What were they trying to accomplish? These expert trajectories become your oracle for comparison.

Dataset Size and Composition: For reliable evaluation, you need 100-200 test scenarios minimum. Compose them as: 50% common cases (typical user requests), 30% edge cases (boundary conditions, unusual but valid requests), 15% error cases (malformed requests, missing information), 5% adversarial cases (attempting to manipulate the agent into harmful behavior).

Trajectory Storage and Logging: Design logging infrastructure to capture complete trajectories: store all tool calls with parameters and responses, agent reasoning/internal monologues, intermediate state, final answer, latency metrics, and error conditions. Use structured logging (JSON) for programmatic analysis. Ensure logs are stored securely—they may contain sensitive user data.

10
Evaluation Dimensions
150+
Typical Test Scenarios
4.2
Avg Steps/Solution
92%
Tool Accuracy
3.2%
Hallucination Rate
0.68
Rater Agreement (κ)
Safety is Non-Negotiable

Don't accept agent designs that optimize for accuracy at the cost of safety. An agent that's 95% accurate but takes unsafe intermediate steps is undeployable. Make safety a prerequisite: agents must pass safety thresholds before accuracy is considered.

Log Everything for Debugging

When an agent goes wrong in production, you'll need to understand the complete trajectory. Invest in comprehensive logging from day one. It costs storage but saves enormous debugging time and reveals systemic failure modes that test set evaluation misses.

Trajectory Replay is Gold

Build systems that can replay agent trajectories. This enables offline analysis, debugging, and retraining. An agent trajectory that's frozen in time lets you understand exactly what happened and why—invaluable for improving the agent and debugging production issues.

Trajectory Evaluation Rubric

Dimension What It Measures Evaluation Method Scoring
Action Relevance Does each action move toward goal completion? Expert trajectory review % relevant actions
Action Correctness Does action accomplish its intended purpose? Compare stated intention vs. actual outcome % correct actions
Tool Accuracy Correct tool selection and parameter usage Automated + expert validation % correct tool calls
Safety Score Absence of safety/privacy violations Automated rules + expert review Pass/Fail or severity count
Error Recovery How well agent recovers from mistakes Trajectory analysis for error patterns Recovery rate (%)
Efficiency Steps to completion vs. oracle Trajectory length vs. expected Efficiency ratio (1.0 = optimal)
Hallucination Rate % of false claims or phantom actions Compare agent states vs. actual tool outputs % trajectories with hallucinations
Final Correctness Is the final answer/solution correct? Match against ground truth % correct solutions

Tool Use Accuracy Scoring Matrix

Tool Call Aspect Correct Acceptable Variation Incorrect
Tool Selection Tool is optimal choice Tool works but suboptimal Wrong tool or non-existent
Parameters All params correct and optimal Correct but non-optimal (e.g., empty filters) Invalid or missing required params
Response Handling Agent uses response appropriately Agent uses response but not fully Agent ignores or misinterprets response
Error Handling Tool error triggers recovery Tool error handled but suboptimally Tool error ignored or crashes
Sequencing Sequence is logical and optimal Sequence works but has redundancies Sequence illogical or inefficient

Efficiency Metric Formulas

# Steps-to-Completion Efficiency Ratio
efficiency_ratio = actual_steps / oracle_optimal_steps
acceptable_range = 1.0 to 1.5
poor_efficiency = > 2.0

# Cost-Adjusted Efficiency (when tool calls have different costs)
cost_efficiency = total_cost / successful_solutions
benchmark_against = simpler_baselines

# Trajectory Length Distribution
percentile_distribution:
  p25 = 25th percentile steps
  median = 50th percentile steps
  p75 = 75th percentile steps
  p95 = 95th percentile steps
  
# Hallucination Rate
hallucination_rate = trajectories_with_false_claims / total_trajectories
unacceptable_threshold = > 5%

# Error Recovery Rate
recovery_rate = errors_caught_and_corrected / total_errors_in_trajectories
target_threshold = > 70%

Key Takeaways

  • The Path Matters: Final-answer evaluation misses critical issues. Trajectory evaluation reveals how agents reason, handle errors, and maintain safety.
  • Multi-Dimensional Assessment: Score on action relevance, correctness, tool accuracy, safety, efficiency, and error recovery. No single metric suffices.
  • Safety First: Safety violations at any step should be disqualifying. Build safety checks into evaluation infrastructure.
  • Tool Use is Learnable: Good tool use accuracy (90%+) is achievable. Track this specifically—it's a good proxy for agent quality.
  • Efficiency Matters: Agents that waste tool calls or take unnecessary steps are expensive in production. Measure efficiency ratios relative to oracle paths.
  • Humans Catch What Automation Misses: Strategic human review of trajectories catches edge cases, safety issues, and subtle hallucinations. Allocate budget for expert review.
  • Log Everything: Complete trajectory logging is essential for debugging, improving, and understanding failure modes in production.

Implement Trajectory Evaluation Now

Start with a simple trajectory logger that captures agent actions and observations. Build automated checks for safety and hallucinations. Add expert human review for edge cases. This foundation scales to production-level evaluation.

Find Evaluation Tools