What is Cost Per Evaluation?
Cost Per Evaluation (CPE) is a foundational financial metric in AI quality assurance. It represents the total expenditure required to complete one evaluation unit, calculated as:
This simple formula masks tremendous complexity. "Total cost" includes everything from human rater hourly wages to cloud infrastructure to management overhead. "Evaluations" varies widely—does it count individual predictions, batches, or complete model assessments? Understanding CPE requires breaking the metric into its components and recognizing that naive calculation misleads.
For a team running 10,000 evaluations with a $50,000 budget, the naive CPE appears to be $5.00. But this obscures critical questions: What quality were those evaluations? How many required multiple raters? What was the agreement rate? How much rework was needed? Was infrastructure utilization efficient?
The most useful CPE calculation acknowledges two things: first, that you must distinguish between cost (what you spend) and value (what you get), and second, that CPE varies dramatically depending on evaluation type, complexity, and domain specialization.
Breaking Down Eval Costs
Evaluation budgets consist of five major categories. Understanding each helps identify where waste occurs and where investment yields returns.
1. Human Annotator Labor Costs
Human evaluation remains the gold standard for complex tasks. Costs depend critically on three factors: hourly rate, task complexity (time per evaluation), and quality requirements (single vs. multiple raters).
Hourly rates vary enormously by geography and expertise:
- General crowdsourced raters (US platforms like Amazon Mechanical Turk): $8–$12/hour, yielding ~$0.25–$0.75 per simple evaluation
- Trained domain contractors (via platforms like Upwork, Outlier): $15–$35/hour, yielding ~$2–$8 per moderate evaluation
- In-house quality specialists: $45–$80/hour ($90K–$160K annually), yielding ~$5–$15 per evaluation depending on complexity
- Expert domain specialists (physicians, lawyers, PhD-level researchers): $75–$200/hour, yielding ~$15–$100+ per specialized evaluation
- International contractor pools (India, Philippines, Eastern Europe): $3–$8/hour, yielding ~$0.10–$0.40 per simple evaluation (but with quality variance and timezone challenges)
Task complexity dramatically affects labor cost. A straightforward binary classification (thumbs up/thumbs down) takes 30–60 seconds. A nuanced evaluation requiring expertise—assessing whether a legal document summary covers all material facts—requires 5–15 minutes. A medical evaluation requiring clinical judgment might demand 30 minutes or more.
Multiple raters increase cost linearly. Evaluating with three independent raters costs roughly 3x the single-rater cost (before you factor in disagreement resolution, which adds another 5–10%).
2. LLM Judge API Costs
Using LLM-as-Judge (an LLM evaluating another LLM's output) has transformed evaluation economics. Costs depend on model choice and prompt complexity:
- GPT-3.5-turbo: ~$0.001–$0.002 per evaluation (1K input tokens, 200 output tokens at $0.50/$1.50 per 1M tokens)
- GPT-4 or Claude 3 Opus: ~$0.01–$0.05 per evaluation depending on prompt length (10K input tokens at frontier pricing)
- Open-source models (Llama 2, Mistral) via local hosting: ~$0.00005–$0.0002 per evaluation (compute cost only, not API markup)
- Claude 3.5 Sonnet: ~$0.003–$0.008 per evaluation (sweet spot for quality/cost)
The hidden cost in LLM judgment is prompt engineering. A poorly designed prompt might require 20,000 tokens of examples and instructions per evaluation. A well-crafted prompt reduces this to 2,000 tokens, saving 10x on costs. Additionally, LLM judges often require validation against human gold labels—you might run 1,000 evals with LLM judges but validate against 200 human evaluations, adding cost.
3. Infrastructure and Tooling Costs
Platform costs scale differently depending on deployment model:
- Annotation platform subscriptions (Scale, Labelbox, Toloka): $0.01–$0.05 per evaluation or $500–$5,000/month flat
- Custom evaluation pipeline (if you build your own): $2,000–$15,000 one-time dev cost, then $500–$2,000/month infrastructure (compute, storage)
- Evaluation SaaS (Evals, Braintrust, LangSmith): $200–$1,000/month or $0.02–$0.10 per evaluation
- Cloud compute (if you run LLM judges yourself): $0.0001–$0.0005 per evaluation depending on model size and batch efficiency
4. Management and Quality Assurance Overhead
Evaluation programs require management:
- Rater calibration and training: 5–15 hours per rater (at $50/hour = $250–$750 per rater)
- Quality audit (checking rater accuracy): 10–20% of evaluation hours reworked
- Disagreement resolution (when multi-rater evals disagree): 5–10% overhead
- Program management (planning, vendor coordination, reporting): 10–15% of rater budget
5. Analysis and Insights Generation
Raw eval scores are worthless without analysis:
- Failure analysis (categorizing what broke): 15–25 hours per major eval cycle
- Segmentation analysis (breaking down by user type, domain, etc.): 10–20 hours
- Trend analysis (comparing across model versions): 5–10 hours
- Reporting and visualization: 5–10 hours
For a typical evaluation program, analysis costs 15–25% of evaluation execution costs. Teams that skip this phase never derive actionable insights from their evaluations.
Human vs. Automated Eval Cost Comparison
The choice between human and automated evaluation is fundamentally about the cost-quality tradeoff. Here's how they compare across multiple dimensions:
| Dimension | Human Evaluation | LLM Judge | Hybrid Approach |
|---|---|---|---|
| Cost per Eval | $5–$50 | $0.001–$0.05 | $0.50–$5 |
| Latency | 24–72 hours | <1 second | 1–10 seconds |
| Consistency | Inter-rater agreement 60–85% | 100% consistency (same input → same output) | High consistency with human oversight |
| Nuance/Judgment Calls | Excellent—humans excel at judgment | Variable—depends on prompt design and judge quality | Good—LLM flags edge cases, human decides |
| Scalability | Limited by rater availability | Unlimited (can run millions instantly) | Can scale intelligently |
| Bias Risk | Demographic bias in raters (well-documented) | Model biases (often subtle, hard to detect) | Mitigated if human review is rigorous |
| Requires Validation? | Not always (humans are trusted) | Yes—must validate against human gold labels | Yes—spot-check LLM judgments |
The cost difference is stark: a single human evaluation at $25 costs what 8,000 LLM-judge evaluations cost. But quality varies dramatically. For tasks where human judgment is essential—evaluating fairness, detecting subtle biases, assessing clinical appropriateness—human evaluation is non-negotiable despite cost. For tasks where objective rules apply—"Does this output contain the word 'apple'?"—LLM judges or even automatic checks are more cost-effective.
Real Cost Calculation Example
Let's walk through a realistic scenario: evaluating a customer support AI model. You plan to run 1,000 evaluations with three independent raters (to ensure quality), then resolve disagreements.
Assumptions
- Task: Rate quality of AI-generated customer support responses (1–5 scale with explanation)
- Complexity: Medium (requires reading context and response, making judgment)
- Rater pool: Mix of in-house specialists ($60/hour) and trained contractors ($20/hour)
- Target: 1,000 evaluations with triple-rater coverage
Labor Cost Breakdown
Rater Hours Calculation:
- Each evaluation takes approximately 4 minutes average (reading + judgment)
- 1,000 evals × 3 raters = 3,000 rater-instances
- 3,000 instances × 4 minutes = 12,000 minutes = 200 hours of rater time
Cost Allocation:
- In-house specialists: 100 hours × $60/hour = $6,000
- Trained contractors: 100 hours × $20/hour = $2,000
- Subtotal: $8,000
Management and QA Overhead
- Rater training/calibration (2 raters × 8 hours × $50 blended rate) = $800
- QA audits (10% of work redone) = 20 hours × $50 = $1,000
- Disagreement resolution (3-rater setup, ~15% disagreement) = 50 hours × $50 = $2,500
- Program management (coordination, reporting) = 15 hours × $80 = $1,200
- Subtotal: $5,500
Tooling and Infrastructure
- Annotation platform fee (Labelbox or similar, prorated) = $1,200
- Cloud storage and compute = $300
- Subtotal: $1,500
Analysis
- Failure categorization (15 hours) = $1,200
- Segmentation analysis (10 hours) = $800
- Reporting and visualization (8 hours) = $640
- Subtotal: $2,640
Total Program Cost
$8,000 + $5,500 + $1,500 + $2,640 = $17,640
Cost Per Evaluation
$17,640 ÷ 1,000 evals = $17.64/eval
But notice: if you only count the direct rater labor ($8,000 ÷ 1,000 = $8), you underestimate true cost by 55%. The "invisible" costs—management, QA, tooling, analysis—are critical to evaluation quality and insights.
This $17.64 CPE is actually on the lower end for medium-complexity evaluations with quality oversight. Healthcare, legal, or financial domain evaluations with expert raters often run $50–$200/eval. Simple crowdsourced tasks run $0.50–$2/eval. The key driver isn't task complexity alone—it's the cost of reliable expertise.
Hidden Costs Teams Ignore
Even the detailed calculation above misses costs that accumulate quickly:
Rater Drift Detection and Retraining
Over a 6-month evaluation program, raters gradually drift from standards (their personal calibration changes). Detecting this requires re-evaluating a sample of past cases (~5% of evaluations). This adds 5% overhead that most teams ignore until quality degrades unexpectedly.
Rework Due to Specification Changes
Midway through evaluation, you realize your rubric was ambiguous or incomplete. This requires re-evaluating prior cases. Industry average: 10–15% rework. A 1,000-eval program effectively becomes 1,100–1,150 evals.
Attrition and Replacement
Trained raters leave. Replacing them requires retraining (5–8 hours at $50/hour = $250–$400 per replacement). If you have a 40% annual attrition rate and maintain a pool of 5 raters, that's 2 replacements/year × $350 = $700 annual overhead. Across a 100-person rater pool (large-scale programs), this becomes $14,000+/year.
Tool Implementation and Integration
One-time costs to set up platforms, integrate with your ML pipeline, and build custom workflows. Often $5,000–$20,000 that gets amortized across evaluations but is easily forgotten.
Compliance and Documentation
In regulated domains (healthcare, finance), evaluation documentation requires legal review, bias audits, and formal sign-off. This adds 20–50% overhead.
Iterative Refinement Cycles
Your first evaluation might reveal that your rubric doesn't measure what you thought. Running a "calibration round" of 50–100 evals to refine definitions is common but rarely budgeted in CPE calculations. If you amortize these across future evaluations, it's still real cost.
Teams quote "$5/eval" based on rater labor alone, then are shocked when actual program cost hits $15–$20/eval after hidden costs surface. Comprehensive CPE accounting prevents budget surprises and enables better prioritization decisions.
Cost Optimization Strategies
CPE is not fixed. Strategic choices can reduce cost by 40–60% without sacrificing quality if done thoughtfully.
Strategy 1: Tiered Evaluation Pyramid
Not all evaluations require human review. Structure evaluations in tiers:
- Tier 1 (Automatic checks, 100% of evals): $0.001/eval. Simple rule-based checks: "Does output contain SQL injection?" "Is response under 500 tokens?" ~50% fail at this tier automatically.
- Tier 2 (LLM judge, 50% passing Tier 1): $0.01/eval. Medium-complexity evaluation with GPT-3.5-turbo. ~70% pass Tier 2 and proceed.
- Tier 3 (Human review, 30% passing Tier 2): $20/eval. Only ambiguous or critical cases get human review.
Cost Math:
- 1,000 evals: 1,000 × $0.001 (Tier 1) + 500 × $0.01 (Tier 2) + 150 × $20 (Tier 3) = $1 + $5 + $3,000 = $3,006
- Average CPE: $3.01/eval (vs. $17.64 for all human)
You've reduced cost 82% while maintaining quality on critical cases. This is the most impactful optimization.
Strategy 2: Smart Sampling Instead of Full Coverage
Evaluate all outputs for your most critical metric, but sample for secondary metrics:
- All 10,000 outputs evaluated for "accuracy" (most critical)
- 1,000 outputs (10% sample) evaluated for "tone appropriateness"
- 500 outputs (5% sample) evaluated for "cultural sensitivity"
Statistical theory shows that a 10% random sample provides ±3.1% confidence intervals at 95% confidence for most metrics (sufficient for decision-making). This 50% cost reduction is justified by statistical power analysis.
Strategy 3: Batch Processing and Model Efficiency
If using LLM judges:
- Batch API requests together (cheaper per token than individual requests)
- Use cheaper models with good validation (GPT-3.5-turbo + validation beat GPT-4 cost-effectiveness)
- Cache prompts and system instructions to reduce token counts
These optimizations can reduce LLM judge costs by 40–60% with zero quality loss.
Strategy 4: Crowd Redundancy Reduction
Instead of 3-rater coverage on all items, use adaptive allocation:
- High-agreement items (both raters agree immediately): 2 raters, no third
- Moderate-agreement items: 3 raters, majority vote
- Disagreement clusters: 5 raters (true consensus-seeking)
Average cost per item: 2.5 raters vs. 3 raters = 17% reduction. Requires adaptive platform capability but pays off at scale.
Strategy 5: Cross-Validation and Gold Label Reduction
Gold labels (human-validated ground truth) are expensive to create but necessary for validation. Instead of creating 10,000 gold labels:
- Create 1,000 gold labels (high-quality, triple-rated, resolved)
- Use cross-validation: shuffle your evaluation set, have different annotators rate subsets, use agreement as quality signal
- Use this 1,000-item gold set to validate automated judges (LLM, rule-based)
This reduces gold label creation costs by 80–90% while maintaining validation rigor.
ROI of Evaluation Spending
Evaluation is an investment. What's the return? This requires connecting eval spending to business outcomes.
The $50K Eval That Prevented a $2M Disaster
A financial services company developed an AI model for loan approval recommendations. Preliminary internal evals showed 94% accuracy. Before broad deployment, they invested $50,000 in rigorous evaluation:
- Domain expert review (financial analysts): $20,000
- Fairness/bias analysis with demographic segmentation: $15,000
- Edge case stress-testing: $10,000
- Regulatory compliance documentation: $5,000
This evaluation discovered that the model had disparate impact on loan applicants based on zip code (a proxy for race under Fair Housing Act). While overall accuracy was 94%, accuracy for applicants in predominantly minority zip codes was 67%. The model would systematically disadvantage protected groups.
Outcome without evaluation: Deploy, discriminatory loan denials occur, lawsuits filed, regulatory investigation, reputational damage. Estimated cost: $2M+ in settlements, fines, and remediation.
Outcome with evaluation: Issue discovered pre-deployment, model retrained with balanced data, bias testing added to standard process. Cost: $50K investment that prevented $2M loss = 40x ROI.
The $0 Evaluation That Cost $10M
A consumer product company skipped evaluation before deploying a customer support chatbot to production (after basic internal testing). Three weeks later:
- Chatbot hallucinated product features that don't exist (15% of conversations)
- Chatbot gave wrong refund policy information (causing refund disputes)
- Customer frustration spike, social media backlash, brand damage
Crisis management, customer service escalation, brand recovery campaigns, and lost customer lifetime value totaled ~$10M.
A $30,000 evaluation program (1,000 evals with subject matter experts) would have caught these issues. $30K cost prevented $10M loss = 333x ROI.
Evaluating Your Evaluation ROI
Calculate this way:
Evaluation ROI = (Cost of Prevented Failure - Evaluation Cost) / Evaluation Cost
For high-stakes domains (healthcare, finance, legal), failure costs are enormous. Even a 1-2% reduction in deployment failures justifies large evaluation budgets.
For low-stakes domains (entertainment recommendation, casual chatbots), evaluation budgets should be smaller but still non-zero.
Budgeting Frameworks for Eval
How much should you spend on evaluation? Three frameworks provide guidance:
Framework 1: Percentage of ML Budget
Allocate evaluation as a percentage of your overall ML engineering budget:
- Startups (pre-product-market-fit): 1–2% of ML budget (assume most budget goes to development)
- Growth-stage companies: 3–5% of ML budget
- Mature products: 5–10% of ML budget (evaluation is continuous)
- Regulated industries: 10–20% of ML budget (compliance demands rigor)
Example: A team with $1M annual ML budget allocates 5% = $50K/year to evaluation. This funds roughly 2,500–3,000 moderate-quality evaluations annually or 500–800 high-quality expert evaluations.
Framework 2: Per-Model-Version Approach
Budget based on model release frequency:
- Minor update (prompt tuning): $2,000–$5,000 (quick validation)
- Standard release (new model, retrain): $15,000–$30,000 (comprehensive evaluation)
- Major release (fundamental architecture change): $50,000–$100,000+ (extensive testing, multiple domains)
If you release 2 major versions/year + 4 standard releases/year + 12 minor updates/year:
Total budget = (2 × $75K) + (4 × $22.5K) + (12 × $3.5K) = $150K + $90K + $42K = $282K/year
Framework 3: Outcome-Based Budgeting
Budget based on business impact of failure:
- High-stakes (healthcare, finance, legal decisions): Budget for 5–10% failure detection. With 90–95% confidence. This drives significant spend.
- Medium-stakes (customer support, content moderation): Budget for 10–15% failure detection. Moderate spend.
- Low-stakes (recommendations, creative generation): Budget for 20–30% failure detection. Minimal spend.
Spend increases with failure consequence severity, not with engineering difficulty.
Cost Benchmarks by Company Size
What do comparable companies spend? Here are realistic benchmarks:
Startup (Series A, <$5M ARR, <10 ML engineers)
- Monthly eval spend: $500–$2,000
- Annual spend: $6,000–$24,000
- Typical approach: Lightweight evaluation, heavy reliance on community feedback
- Eval team size: 0.5–1 FTE (shared role)
- Cost per eval: $2–$8 (mostly crowdsourced or cheap LLM judges)
Growth-Stage (Series B–C, $5M–$100M ARR, 20–50 ML engineers)
- Monthly eval spend: $2,000–$8,000
- Annual spend: $24,000–$96,000
- Typical approach: Mix of human + automated, platform investment starting
- Eval team size: 1–3 FTEs (dedicated role)
- Cost per eval: $5–$20 (trained annotators, LLM validation)
Mid-Market (Series D+, $100M–$1B ARR, 50–200 ML engineers)
- Monthly eval spend: $10,000–$50,000
- Annual spend: $120,000–$600,000
- Typical approach: Sophisticated tiered evaluation, domain expertise, continuous monitoring
- Eval team size: 3–10 FTEs (dedicated team, multiple specialties)
- Cost per eval: $10–$50 (expert-level evaluation)
Enterprise (>$1B ARR, 200+ ML engineers, regulatory scrutiny)
- Monthly eval spend: $50,000–$500,000+
- Annual spend: $600,000–$6,000,000+
- Typical approach: Comprehensive multi-domain evaluation, regulatory documentation, adversarial testing
- Eval team size: 10–50+ FTEs (specialized teams by domain)
- Cost per eval: $15–$150 (high-expertise evaluation)
These benchmarks reflect that larger organizations have higher absolute spend but often lower cost-per-eval (due to scale economies) and higher per-engineer allocation (evaluation becomes non-optional at scale).
When to Spend More vs. Less
Not all evaluations deserve equal budget. Here's how to prioritize:
Spend More Evaluation Budget When:
- Failure has high consequences: Healthcare (patient safety), finance (regulatory/fraud), legal (liability). Budget $50–$200/eval.
- Fairness/bias matters: Hiring, lending, criminal justice. Demographic analysis multiplies cost but is non-negotiable. Budget 20–50% premium.
- Model is deployed at scale: Small error rates become large absolute failures when serving millions. Budget scales with impact potential.
- Domain expertise is specialized: Domain experts (physicians, lawyers) cost more but catch domain-specific failures. Accept higher unit cost.
- Regulatory oversight exists: Healthcare, finance, government. Documentation and compliance add 20–50% overhead but are mandatory.
- Trust is differentiator: When customers or partners base decisions on your model's output (B2B use cases), evaluation ROI is huge.
Spend Less (But Don't Skip) When:
- Failure is reversible: Recommendation systems, content suggestions. Users simply ignore bad recommendations. Budget $2–$5/eval.
- Rapid iteration needed: Early-stage products, fast experimentation. Use cheap evals to validate hypotheses. Budget $1–$3/eval initially.
- Evaluation is automated-able: Objective tasks with clear ground truth (code generation, math problems). LLM judges or automatic checks suffice. Budget $0.01–$0.10/eval.
- Community feedback available: Products with large user bases naturally provide evaluation signal. Use this (with some verification). Budget $0/eval in-house + verification layer.
Decision Matrix
Use this framework:
| Risk of Failure | Automation Possible | Recommended Spend | Example Domain |
|---|---|---|---|
| Very High | Low | $50–$200/eval (expert-heavy) | Medical diagnosis, critical bug detection |
| High | Low | $15–$50/eval (mixed) | Financial decisions, legal analysis |
| High | High | $2–$10/eval (automated + spot-check) | Code generation, classification |
| Medium | Low | $5–$20/eval (human) | Customer support quality, content moderation |
| Medium | High | $0.50–$3/eval (automated) | Recommendation filtering, search ranking |
| Low | High | $0.01–$0.50/eval (automated only) | Content suggestions, ads ranking |
Set evaluation budgets before model development, not after. This forces trade-off thinking: Will you sacrifice some model sophistication to fund better evaluation? Or build the fancier model and accept higher deployment risk? Making this decision upfront prevents underfunded evaluation programs.
