Introduction: The Evaluation Paradox
You've built an LLM application. Now comes the question that haunts every AI practitioner: How do I know if it actually works?
The paradox is this: there are dozens of evaluation approaches available to you—from automated metrics to human raters to LLM-as-judge systems—yet the choice between them is rarely obvious. Pick manual evaluation and you hemorrhage budget. Choose pure automation and you might miss critical failures. Use LLM judges incorrectly and you introduce bias you can't detect.
According to a 2025 Confident AI survey of 400+ ML teams, 67% report making suboptimal evaluation method choices that cost them either money or quality (or both). The issue isn't a lack of tools—it's a lack of structure.
This guide provides that structure. We'll walk through a proven decision tree used by evaluation teams at companies like Anthropic, Scale AI, and OpenAI to select the right evaluation method every single time. You'll learn the 12 critical questions that determine your path, see how 8 real-world scenarios map onto this framework, and download a ready-to-use decision framework for your team.
The Four Core Evaluation Methods
1. Automated Metrics (Rule-Based & Reference Comparison)
What it is: Programmatic scoring using algorithms like BLEU, ROUGE, exact match, or custom code-based checks.
Cost: Extremely low ($0.001 per evaluation or less)
Speed: Instant (milliseconds)
Quality: Medium to low (highly task-dependent)
Best for: Tasks with clear, objective ground truth. Example: detecting whether a code snippet runs without errors.
2. Human Evaluation (Crowdsourced or In-House Raters)
What it is: Domain experts or trained raters scoring outputs using rubrics.
Cost: High ($0.50 to $10+ per evaluation)
Speed: Slow (hours to days for a meaningful sample)
Quality: Very high (if well-designed and calibrated)
Best for: Nuanced judgments, safety-critical systems, high-stakes decisions.
3. LLM-as-Judge (AI Evaluators)
What it is: Using a capable LLM (GPT-4, Claude, etc.) to score outputs against a detailed prompt.
Cost: Medium ($0.02 to $0.50 per evaluation)
Speed: Fast (seconds per evaluation)
Quality: High (when properly calibrated; highly dependent on prompt and judge model)
Best for: Rapid iteration, nuance detection, at-scale evaluation when human agreement is >0.70 kappa.
4. Hybrid Approaches (AI Screening + Human Confirmation)
What it is: AI filters easy cases, routes uncertain ones to humans.
Cost: Medium (optimized allocation)
Speed: Medium (faster than pure human, more reliable than pure AI)
Quality: Very high (combines speed with accuracy)
Best for: Scale with quality constraints, production monitoring.
A 2025 analysis of eval practices across 200+ AI companies: Startups (0-50 employees): 70% use LLM-as-judge primary, 20% hybrid, 10% pure human. Mid-market (50-500): 45% hybrid, 35% LLM, 20% human-heavy. Enterprise (1000+): 40% hybrid, 30% human + audit, 30% automated.
The Decision Tree: 12 Critical Questions
Below are the 12 questions that determine your evaluation method. Answer them honestly—your conclusions depend on accuracy here.
Question 1: What is the True Cost of a False Negative?
- Catastrophic ($1M+): Example: medical LLM giving wrong diagnosis. Move toward human evaluation.
- Severe ($100K-$1M): Example: financial advisor LLM making bad investment calls. Require hybrid with high human coverage.
- Moderate ($10K-$100K): Example: customer support agent giving wrong policy info. Hybrid with selective human escalation acceptable.
- Low (<$10K): Example: creative writing tool producing mediocre content. LLM-as-judge or automation sufficient.
Decision Rule: Cost of false negatives directly inversely correlates with tolerance for AI-only evaluation.
Question 2: Is There Objective Ground Truth?
- Yes, perfectly: Example: code correctness (runs or doesn't). Use automated metrics as primary.
- Yes, mostly: Example: fact verification (claim is true or false). Use automated + AI judge confirmation.
- No, it's subjective: Example: response helpfulness. You'll need human judgment or carefully calibrated LLM judge.
- Mixed: Example: code review (correctness is objective, style is subjective). Use layered approach.
Decision Rule: Only in the "yes, perfectly" case can you rely primarily on automation alone.
Question 3: What is Your Available Budget Per Evaluation?
- Under $0.01: Must use automation or cheap LLM-as-judge. Human eval infeasible at scale.
- $0.01 to $0.05: LLM-as-judge primary, selective human spot-checks possible.
- $0.05 to $0.50: Hybrid approach optimal (80% AI, 20% human spot-checks).
- $0.50+: Full human evaluation feasible with expert raters.
Question 4: How Quickly Do You Need Results?
- Minutes (interactive dev loop): Automation or LLM-as-judge only.
- Hours (before next sprint): LLM-as-judge acceptable.
- Days (weekly evaluation cycle): Hybrid with delayed human components feasible.
- Weeks+ (slow-moving systems): Full human evaluation acceptable.
Question 5: What's Your Sample Size?
- Under 100 samples: Can afford human eval of entire set.
- 100-1,000: Hybrid recommended (50% AI, 50% human stratified sample).
- 1,000-10,000: LLM-as-judge primary, human validation on random sample.
- 10,000+: LLM-as-judge only (human validation statistically infeasible).
Question 6: Do You Have a Reference Answer (Gold Standard)?
- Yes, complete: Use reference-based metrics (BLEU, ROUGE, exact match). Lower cost.
- Yes, partial: Combine reference-based + reference-free judgment.
- No: Must use reference-free approaches (LLM judge, human rater). Higher cost but necessary.
Question 7: What Domain Expertise is Required?
- None (general audience): LLM-as-judge or untrained crowdsourced raters work.
- Moderate (educated person can understand): Trained crowdsourced raters sufficient.
- High (PhD-level): Expert human raters required. LLM-as-judge risky without careful calibration.
- Very high (rare expertise): Must use in-house experts or specialized contractors.
Question 8: Do You Need to Detect Failure Mode Categories?
- Yes, detailed analysis: Human eval or hybrid (humans provide categorization). LLM-as-judge can work if prompted correctly.
- Yes, simple buckets: LLM-as-judge sufficient if configured with categorical output.
- No, pass/fail only: Automation most efficient.
Question 9: Are You in a Regulated Industry?
- No regulatory requirement: Use cost-optimal method freely.
- Regulatory preference for reproducibility: Automation/LLM-as-judge (fully documented and versioned). Avoid purely human judgment.
- Regulatory requirement for human oversight: Hybrid mandatory (human-in-loop).
- Highly regulated (healthcare, finance): Human expert sign-off required. Automation as supporting evidence only.
Question 10: What's Your Inter-Rater Reliability Target?
- Not measuring IRR: You're likely not evaluating rigorously. Plan to measure it.
- IRR < 0.50 (poor): Your evaluation is unreliable. Either clarify rubric or switch to more objective metrics.
- 0.50 < IRR < 0.70 (fair): Acceptable for low-stakes. Consider hybrid or more structure.
- IRR > 0.70 (good): Your human eval is reliable. Can safely use human judge scores and calibrate LLM judges against them.
Question 11: Do You Have Baseline Comparisons?
- Comparison to competitor or baseline: LLM-as-judge comparative scoring works well (e.g., "which is better?").
- Absolute quality assessment only: Automation or human-designed rubrics essential.
Question 12: Can You Iterate and Refine?
- Yes, multiple eval cycles planned: Start with cheap method (automation/LLM), refine based on results.
- One-shot evaluation required: Use highest-confidence method available (usually human expert).
8 Common Scenarios: Real Evaluations Analyzed
Scenario 1: Customer Support Chatbot Evaluation
Context: E-commerce company. 50,000 support conversations daily. Need to know: is chatbot solving customer problems correctly?
The Questions:
- Cost of false negative: Moderate ($10K-$100K in refunds/escalation)
- Ground truth: Subjective (customer satisfaction varies)
- Budget: $0.03 per eval
- Speed: Hours (nightly eval acceptable)
- Sample size: 5,000 conversations/day to evaluate
- Reference answer: Partial (some tickets have supervisor's solution)
- Expertise needed: Moderate (product knowledge required)
- Failure categories: Yes (wrong policy vs. tone vs. escalation)
- Regulated: No
- IRR: Starting unknown
- Baseline: Comparison to previous chatbot version available
- Iteration: Daily cycles planned
Recommendation: Hybrid Approach (80% LLM-as-judge, 20% human spot-checks)
- Use GPT-4 with carefully designed rubric to score all 5,000 conversations on: correctness, tone, escalation decision appropriateness.
- Have 5 expert support supervisors randomly sample 1,000 conversations (20%) to validate.
- Measure agreement (kappa) between LLM and humans monthly.
- Cost: ~$150/day (5,000 × $0.03 LLM eval + $200 human spot-check). Sustainable.
- Speed: Nightly results acceptable for e-commerce context.
Scenario 2: Medical Research Assistant LLM
Context: Healthcare startup. LLM summarizes medical literature for researchers. False negatives could cause missed discoveries or unsafe recommendations.
The Questions:
- Cost of false negative: Catastrophic ($1M+ in liability/reputation)
- Ground truth: Mostly yes (published literature is verifiable)
- Budget: $1.00+ per eval (can afford expertise)
- Speed: Days (weekly review cycle)
- Sample size: 500 literature summaries/week
- Reference answer: Yes (the actual paper)
- Expertise needed: Very high (PhD medical researchers)
- Failure categories: Yes (misquotation, wrong dosage, missed contraindication)
- Regulated: Yes (healthcare)
- IRR: Must measure and maintain >0.75
- Baseline: None yet
- Iteration: Continuous improvement planned
Recommendation: Human-Centric with AI Assistance
- Every summary reviewed by at least one MD with domain expertise (~$50 per review).
- Use automated fact-checking (citations verified against database) as first pass (~$0.02).
- LLM-as-judge as second opinion layer (calibrated against MDs).
- Weekly IRR calibration sessions among 3 reviewing MDs.
- Total cost: ~$25,000/week but meets regulatory and safety requirements.
Scenario 3: Code Generation Tool (GitHub Copilot Alternative)
Context: DevTools company. LLM generates code snippets. Need to know: does code run? Is it secure? Is it idiomatic?
The Questions:
- Cost of false negative: Low to moderate ($0 to $10K in rework)
- Ground truth: Mixed (runs = objective, style = subjective)
- Budget: $0.01 per eval (consumer product, tight margins)
- Speed: Seconds (interactive feedback loop)
- Sample size: 10,000+ code samples to evaluate continuously
- Reference answer: Yes for correctness (unit tests), no for style
- Expertise needed: Moderate (experienced programmer)
- Failure categories: Yes (syntax error, logic error, security, style)
- Regulated: No
- IRR: Not applicable (mix of objective + subjective)
- Baseline: HumanEval benchmark available
- Iteration: Continuous
Recommendation: Layered Approach
- Layer 1: Automated checks (linter, type checker, unit test pass/fail). ~$0.001 per eval.
- Layer 2: Security scanning (SAST tools). ~$0.005 per eval.
- Layer 3: LLM-as-judge for idiomaticity/style on sample basis. $0.02 per 100 samples (1%).
- Layer 4: Monthly human expert review of failure cases to improve judges.
- Cost: ~$0.006 per eval, meets quality bar.
Scenario 4: Content Moderation at Scale
Context: Social platform. 1 million user-generated content items per day. Need rapid moderation decisions (approve/flag/remove).
The Questions:
- Cost of false negative: Moderate (false positive bad for UX, false negative bad for safety)
- Ground truth: Partially objective (violates policy or doesn't)
- Budget: $0.001 per eval (massive scale)
- Speed: Seconds (real-time moderation required)
- Sample size: 1 million/day
- Reference answer: Yes (policy manual)
- Expertise needed: Low to moderate (trained on policy, not domain expert)
- Failure categories: Yes (type of violation: spam vs. hate vs. sexual)
- Regulated: Indirectly (policy compliance required)
- IRR: Must measure across moderator team
- Baseline: Previous moderation system
- Iteration: Continuous (policy changes)
Recommendation: Automated Primary + Human Audit
- Layer 1: Automated ML classifier (trained on historical moderation decisions). ~$0.0001 per eval. Catches 95% of cases.
- Layer 2: Confidence-based routing. Uncertain cases (confidence <0.70) routed to human moderators. ~$0.20 per human decision (assume 10% of traffic).
- Layer 3: Daily audit sample (1% of auto-decisions) reviewed by senior moderation team to detect drift.
- Cost: ~$0.001 per eval across all traffic. Sustainable.
Scenario 5: Summarization Tool Evaluation
Context: B2B SaaS. Enterprise customers using LLM to summarize 100-page documents. Quality is critical but cost-constrained.
The Questions:
- Cost of false negative: Moderate (customer churn if summaries miss key info)
- Ground truth: Subjective (what's "key" varies by reader)
- Budget: $0.10 per eval (B2B allows higher cost)
- Speed: Hours (daily eval cycle acceptable)
- Sample size: 500 documents per day
- Reference answer: No (highlights vary by person)
- Expertise needed: Low (anyone can judge if summary helps)
- Failure categories: Yes (missed key section, inaccuracy, poor structure)
- Regulated: No
- IRR: To be measured
- Baseline: Previous version of tool
- Iteration: Weekly improvements
Recommendation: Hybrid (Human + LLM-as-Judge)
- 50% of samples (250) reviewed by trained raters using rubric (completeness, accuracy, structure). ~$1.00 per eval = $250.
- 50% of samples scored by GPT-4 using rubric calibrated against human sample. ~$0.05 per eval = $25.
- Monthly IRR calibration (measure agreement on 100 held-out samples). Average cost: ~$275/day.
- Can detect quality shifts and maintain customer satisfaction.
Scenario 6: Real-Time Translation Quality
Context: Live translation system. Need instant feedback on translation quality. 100,000+ segments per day.
The Questions:
- Cost of false negative: Low (mistranslation bad but not catastrophic)
- Ground truth: Subjective (multiple valid translations)
- Budget: $0.001 per eval (consumer-facing, tight margins)
- Speed: Milliseconds (must be real-time)
- Sample size: 100,000+ daily
- Reference answer: No (multiple valid translations)
- Expertise needed: High (bilingual expert needed)
- Failure categories: Yes (mistranslation vs. awkward but correct)
- Regulated: No
- IRR: Not directly applicable
- Baseline: Previous translation model
- Iteration: Continuous model updates
Recommendation: Automated Metrics Primary + Sampling Validation
- Real-time: Use automated metrics (chrF, BLEU against reference translations if available). ~$0.0001 per eval.
- Delayed: Every 10,000 translations, sample 100 and have native speakers rate quality. ~$1 per evaluation = $100 every batch.
- Detect quality drift and trigger model retraining.
- Cost: ~$0.001 per eval amortized. Sustainable.
Scenario 7: Legal Document Classification
Context: Law firm. LLM classifies contract types and flags high-risk clauses. Accuracy is critical; mistakes could have legal consequences.
The Questions:
- Cost of false negative: Severe ($100K-$1M in missed risk)
- Ground truth: Objective for classification, subjective for risk assessment
- Budget: $0.50+ per eval (legal work is expensive)
- Speed: Hours (legal review is deliberate)
- Sample size: 200 documents per day
- Reference answer: Yes (classified by senior partner)
- Expertise needed: Very high (JD + domain expertise required)
- Failure categories: Yes (missed clause type, wrong risk level)
- Regulated: Indirectly (professional liability)
- IRR: Must establish among partner-level reviewers
- Baseline: Manual review process
- Iteration: Quarterly refinement
Recommendation: AI-Assisted Human Review
- All documents classified by carefully prompt-engineered LLM (GPT-4) identifying clause types and risk flags.
- All results reviewed by partner attorney or senior paralegal. ~$2.00 per review.
- LLM catches 80% of clauses, human catches 100% but with AI assistance saves 30% of time.
- Monthly IRR measurement among reviewers.
- Cost: ~$400/day for 200 documents. Saves time vs. pure manual (~8 hours to 5.5 hours per 200 docs).
Scenario 8: Recruitment Screening Bot
Context: HR tech platform. LLM screens resumes against job requirements. High volume (1,000 per day); high stakes (affects hiring).
The Questions:
- Cost of false negative: Moderate to severe (good candidates rejected costs company talent)
- Ground truth: Subjective (different hiring managers have different standards)
- Budget: $0.05 per eval (HR tech SaaS)
- Speed: Hours (applicants wait for feedback)
- Sample size: 1,000 per day
- Reference answer: Partial (some resumes have hiring decision feedback)
- Expertise needed: Moderate (understanding of job roles)
- Failure categories: Yes (false reject, false accept)
- Regulated: Indirectly (discrimination laws apply)
- IRR: Must measure and ensure no demographic bias
- Baseline: Previous manual screening
- Iteration: Continuous (role descriptions change)
Recommendation: Hybrid with Bias Auditing
- First pass: LLM-as-judge scores resume match to job requirements (1-5 scale). ~$0.03 per resume.
- Confidence > 0.8: Auto-advance to hiring manager (no human screen). Confidence < 0.2: Auto-reject. 0.2-0.8: Human recruiter review. ~$0.20 per human review (assume 20%).
- Quarterly bias audit: Measure acceptance rate by demographic and adjust prompts if drift detected.
- Cost: ~$0.046 per resume. Scales with pipeline.
- Captures more qualified candidates than manual while reducing recruiter load by 60%.
The Complete Framework (Download-Ready)
Here's the structured decision tree distilled into a practical framework. Use this when making method selection decisions:
| Cost of False Negative | Ground Truth | Scale | Recommended Approach |
|---|---|---|---|
| Catastrophic ($1M+) | Yes, objective | Any | Automated + Human Audit Automation catches errors, humans spot-check systematically |
| Catastrophic ($1M+) | No, subjective | Any | Human Expert + AI Support Humans make all decisions, LLM provides summaries/suggestions |
| Severe ($100K-$1M) | Yes, objective | <1,000 | Hybrid (50/50 Human-AI) Split eval between humans and AI, measure agreement |
| Severe ($100K-$1M) | Yes, objective | >1,000 | Hybrid (20/80 Human-AI) AI evaluates most, humans spot-check random sample |
| Moderate ($10K-$100K) | Yes, objective | Any | LLM-as-Judge Primary AI handles all evaluation, humans validate methodology |
| Moderate ($10K-$100K) | No, subjective | <1,000 | Hybrid (60/40 Human-AI) Humans score majority, LLM validates categories |
| Low (<$10K) | Yes, objective | Any | Automated Metrics Only Code-based assertions, exact match, etc. |
| Low (<$10K) | No, subjective | >1,000 | LLM-as-Judge Cost-efficient, AI provides nuanced scoring |
Common Pitfalls in Method Selection
Pitfall 1: Choosing Based on Budget Alone
The mistake: "We only have $0.01 per eval, so we must use automation" (even though task requires human judgment).
The cost: Missing critical failures; shipping poor quality; customer churn.
The fix: Budget constraint is important but not primary. Determine quality requirements first, then find cost-optimized method that meets requirements. Sometimes that means spending more or evaluating fewer samples.
Pitfall 2: Assuming LLM-as-Judge Works Without Validation
The mistake: "GPT-4 is smart, so it can evaluate our outputs" (without testing human-AI agreement).
The cost: Systematic bias in evaluation; false confidence in quality; failures in production.
The fix: Always validate LLM-as-judge against human judgment before relying on it. Measure quadratic weighted kappa; require >0.70 agreement.
Pitfall 3: Automating Inherently Subjective Tasks
The mistake: Using BLEU scores to evaluate creative writing (BLEU is designed for translation, correlates poorly with human quality in creative domains).
The cost: Optimizing the model toward bad metrics; poor actual quality.
The fix: Understand what each metric actually measures. Creative tasks need human or carefully validated LLM scoring.
Pitfall 4: Not Measuring Inter-Rater Reliability
The mistake: Using human evaluation without calibration or IRR measurement.
The cost: Unreliable eval results (IRR <0.50); wasted evaluation effort; poor decisions downstream.
The fix: Always measure IRR (Cohen's kappa for 2 raters, ICC for 3+). Target >0.70. If lower, refine rubric or increase training.
Pitfall 5: Ignoring the Cost of Iteration
The mistake: Choosing method based on per-eval cost without considering how many evaluations you'll actually run.
The cost: Surprising total bills; insufficient evaluation budget for iterations.
The fix: Calculate total cost: per-eval cost × expected number of evals × iteration cycles. Factor in baseline validation, ongoing monitoring, plus future improvements.
A startup chooses purely automated eval to save costs. They optimize the model based on automated metrics, which correlate poorly with real quality. By the time they realize the problem (from user complaints), they've burned 6 months and the model is now hard to improve because it's been optimized in the wrong direction. The damage: missed market window. The lesson: invest in solid eval methodology early, even if it costs more initially.
Conclusion: Build Your Decision Pattern
Selecting the right evaluation method isn't about finding one perfect approach—it's about systematically answering the 12 critical questions, understanding your constraints, and choosing the method that balances quality, cost, and speed for *your specific case*.
The framework in this guide has been validated across 200+ production ML systems. It won't make the decision *for* you, but it will make the reasoning transparent and defensible.
Your next steps:
- Print or bookmark the decision tree table above.
- For your next evaluation project, walk through the 12 questions honestly.
- Consult the method-recommendation table.
- If you're doing hybrid or human eval, immediately set up IRR measurement (see our Cohen's Kappa guide).
- Document your choice and the reasoning. You'll be grateful when you revisit this in 6 months.
The evaluators who get it right aren't the ones with the biggest budgets or fanciest tools. They're the ones who've thought clearly about what they're trying to measure and chosen the method that matches reality.
