The Metric Selection Problem

Most teams default to familiar metrics. They use accuracy because they've always used accuracy. They measure BLEU score for translation because benchmark papers use BLEU. They track CSAT for support because the industry standard is CSAT. These default metrics often measure the wrong thing for your specific system and use case. Wrong metrics produce misleading signals about system performance. You optimize for the metric and degrade actual system quality.

Consider a chatbot that optimizes for response length because longer responses seem better by human raters. The system learns to add unnecessary details, making users wait longer for answers they got quickly elsewhere. The metric (response length as proxy for quality) was orthogonal to the actual goal (user satisfaction). This happens constantly because metric selection is usually treated as a low-priority decision, not the engineering problem it actually is.

The cost of wrong metrics compounds. You build your system optimizing for the metric. You make deployment decisions based on it. You compare systems using it. You train teams to understand it. Then months later you realize it didn't measure what mattered. You've built infrastructure around a misleading signal. The cost of switching metrics is high, so the wrong metric persists longer than it should.

Systematic metric selection prevents this. Rather than using default metrics, you use a framework to evaluate whether each potential metric is valid, actionable, legible, integrated, and durable. This requires more upfront work but saves massive amounts of rework. The VALID framework makes metric selection a first-class engineering decision, not an afterthought.

The VALID Framework

Valid: Does the metric measure what you care about? This is content validity. A metric is valid if experts agree it measures the construct you're targeting. Accuracy is valid for measuring whether a system produces correct factual output. It's less valid for measuring whether the output is useful to users in context. Validity doesn't mean the metric is comprehensive; it means the metric actually measures something real about the system's behavior.

Actionable: Can the system improve this score? More precisely: can engineering take specific actions that improve the metric? A metric is actionable if you know what to do to improve it. "User happiness" is hard to action. "Time to first response" is easy to action (optimize latency). Actionable metrics guide engineering work. Non-actionable metrics confuse teams because they don't know how to improve the score.

Legible: Do stakeholders understand what the metric means? A metric is legible if you can explain it to a non-technical executive and they understand what result it represents. BLEU score is not very legible (people don't understand the computation). "Translation accuracy against expert references" is legible. Legibility matters because non-technical stakeholders make deployment decisions. If they don't understand your metric, they can't trust your results.

Integrated: Does the metric connect to outcomes you care about? A metric is integrated if improvements in the metric correlate with improvements in business or user outcomes. If your metric improves but customer satisfaction decreases, the metric is not integrated with what matters. Integration is empirically validated, not assumed. You need data showing the correlation.

Durable: Is the metric stable over time? A metric is durable if it works on data from different time periods and different use cases. A metric that performs perfectly on January data but fails on April data is not durable. Durable metrics generalize beyond the specific conditions where you developed them. This prevents surprises when you actually deploy the system.

5
VALID Dimensions
7
Validity Questions
5
Actionability Questions
3
Metric Stacks

Checklist Part 1: Validity

When assessing validity, ask seven questions. First: Do domain experts agree this metric measures the right construct? Have three subject matter experts review your metric definition and independently assess whether it measures what you claim. If they disagree, your metric definition is ambiguous or the construct itself is poorly defined. Validity requires expert consensus that the metric is sensible.

Second: Can the metric be valid or invalid given the system design? Some systems can't achieve validity on certain metrics by design. If you're evaluating a summarization system, asking whether it produces grammatically perfect text is valid but less important than whether it preserves meaning. Asking whether it matches the exact length the user specified is valid only if length specification was a requirement. Match metric to system purpose.

Third: What is the metric NOT measuring that matters? This forces explicit acknowledgment that every metric is incomplete. Accuracy doesn't measure latency. Response length doesn't measure correctness. Identifying what you're not measuring with this metric helps you avoid false confidence. It also tells you what other metrics you need.

Fourth: Would you make the same deployment decision if you only had this metric? If not, the metric alone is insufficient. You likely need multiple dimensions. If you wouldn't deploy a system based solely on this metric, it's either not your most important metric or you need additional information. This question clarifies role.

Fifth: Have you validated this metric against ground truth? For automated metrics especially, validate that the metric correlates with human judgment. A semantic similarity metric is only valid if human experts agree that higher similarity corresponds to better quality. Correlation with ground truth is the empirical test of validity.

Sixth: Does the metric have established validity in research literature? Check whether your metric appears in published evaluation work. If it does, cite that work. If it doesn't, you're using a novel metric, which is fine but means you need to establish validity yourself. Novel metrics can be valid but require extra validation work.

Seventh: Can the metric be gamed or misled without actually improving the system? If so, it has limited validity for your purpose. A metric that improves by surface-level changes without real improvement is not a reliable measure of the construct. This is especially true for automated metrics that can be exploited.

Checklist Part 2: Actionability

First: Can engineering identify specific changes that improve this metric? Can you point to model architecture changes, training data modifications, or prompting changes that move the metric? If you can't, the metric doesn't guide engineering. It's a retrospective measurement tool, not an optimization target.

Second: Is the metric sensitive enough to detect improvements engineering can make? Some metrics are so coarse that all realistic improvements fall within noise. A binary metric (good/bad) might not discriminate between engineering efforts well. Finer-grained metrics (1-5 scale or continuous) usually detect changes better. Match metric granularity to engineering changes.

Third: Does the metric provide direction for improvement? If accuracy dropped, what should engineering do? If the metric is accuracy, the direction is "improve accuracy" but that doesn't specify what changed or how to fix it. Better actionable metrics decompose into dimensions that point to root causes. "Accuracy on numerical reasoning decreased" guides work better than "accuracy decreased."

Fourth: Can you forecast impact before deploying changes? Actionable metrics allow you to predict how changes will impact production. If you improve a metric in lab testing, you should be able to forecast whether it will improve in production. Non-actionable metrics don't provide this forecasting capability.

Fifth: Are there clear intervention paths when the metric declines? If a metric degrades, can you identify what caused it and how to fix it? This requires diagnostic capability. The metric should support diagnosis, not just measurement. Diagnostic metrics are highly actionable.

Checklist Part 3: Sensitivity

Sensitivity means the metric detects real differences. Does the metric discriminate between clearly better and worse systems? Test this empirically. Run a system with known quality and one with quality degraded intentionally. Can your metric detect the difference? If not, the metric lacks sensitivity for the changes you care about.

What's the minimum detectable effect size? This requires power analysis. If you want to detect a 2% improvement in performance, do you need 100 examples or 10,000? The metric's variance determines how much data you need. High-variance metrics require enormous sample sizes to detect small improvements. They're less useful for iterating on systems because you can't detect incremental progress.

Is the metric responsive to model changes? Run multiple model variants and check whether the metric differentiates them. If 10 different models all score similarly on your metric, it has low sensitivity. It's not capturing meaningful differences between systems. This is especially important for automated metrics, which can be insensitive to real quality differences.

Have you compared sensitivity to alternative metrics? Your metric might be less sensitive than alternatives. Compare your top candidates head-to-head. Measure how much sample size each would require to detect a 5% improvement. Prefer more sensitive metrics because they require less data and shorter evaluation cycles.

Checklist Part 4: Cost and Scale

What's the per-unit cost of computing this metric? Some metrics require human annotation at $0.50 per example. Others require API calls at $0.001. Some run offline on cached data at near-zero cost. Calculate total cost for your typical evaluation sizes. A metric costing $100K to evaluate on production data might need to be deprecated in favor of a cheaper alternative.

Can you compute this metric for 10K examples? For 1M examples? Some metrics scale linearly with examples. Others have fixed costs that don't scale. Automated metrics usually scale well. Human annotation doesn't. For operational metrics that need to run daily, scalability is critical. For quarterly evaluations, it's less important.

What's the latency for computing this metric? Do you need results in minutes, hours, or days? Real-time metrics (latency, availability) compute instantly. Human annotation metrics take weeks. Automated metrics vary. Match metric latency to your decision timeline. You can't iterate on a 3-week annotation process weekly.

Are there human vs. automated tradeoffs? Automated versions of a metric are cheaper but less accurate. Human versions are expensive but more valid. Sometimes a hybrid approach—automated screening with human spot-check—provides the best tradeoff. Explicitly consider all options rather than defaulting to human annotation.

Checklist Part 5: Known Failure Modes

Every metric has failure modes. When does this metric break or mislead? Document failure modes explicitly. Accuracy breaks down when classes are imbalanced. BLEU breaks down for paraphrases. Customer satisfaction breaks down for premium users who have higher expectations. Documenting failure modes prevents false confidence in the metric.

What types of inputs does the metric fail on? Some metrics work great for common cases and fail on edge cases. Some work for typical lengths and break for very long or very short inputs. Some work for English and break for code-mixed text. Identify input types where the metric is unreliable and either avoid those inputs in evaluation or use different metrics for them.

Is the metric adversarially resistant? Can someone gaming the system artificially inflate the metric without actually improving performance? If so, the metric can't be used for optimization or incentives. Adversarial resistance is less important for retrospective measurement but critical for forward-looking optimization.

How sensitive is the metric to metric-specific noise? A metric might be robust to model changes but sensitive to evaluation setup changes. Different annotation guidelines or annotation orders might affect the metric. Document these sensitivities so you can account for them.

Matching Metrics to System Types

RAG Systems: Retrieval accuracy (is the retrieved context relevant?), ranking quality (is relevant context ranked highest?), answer accuracy (does the generated answer use the context correctly?), context coverage (does the context contain sufficient information to answer?). A typical RAG metric stack: retrieval precision@5, end-to-end answer accuracy, coverage F1. These measure different failures modes in the pipeline.

Chatbots: Coherence (is the response topically consistent?), relevance (does it address the user query?), factuality (are facts accurate?), tone appropriateness (is the tone suitable for the context?), instruction following (does it follow system instructions?). Metric stack: relevance scores, hallucination rate, tone classifier, instruction adherence.

Classifiers: Accuracy, precision, recall, F1 (choose based on class balance and cost of errors), ROC-AUC for ranking quality, calibration metrics for confidence quality, specificity and sensitivity by class. Metric stack depends on your specific class-balance and error-cost situation.

Agents: Task completion rate, steps to completion, constraint violations, cost efficiency (did it minimize API calls?), safety compliance. Traditional accuracy doesn't apply because the agent might succeed through different paths. Focus on outcomes and efficiency.

Code Generation: Functional correctness (does code run without error?), test pass rate, code quality (style, readability), efficiency (does it meet latency requirements?). Don't rely solely on test pass rate; combine with efficiency and quality metrics.

The Metric Retirement Decision

Metrics accumulate. You add a new metric when an old one becomes insufficient, but you don't always retire the old one. Teams end up tracking 20 metrics from different eras, some of which are outdated or redundant. Retiring metrics is difficult because you lose historical continuity. But metrics that no longer serve decision-making should be deprecated.

When should you retire a metric? When the metric no longer informs decisions you make. When it's been superseded by better metrics that measure the same thing. When the failure modes have become too problematic. When you stop acting on changes in the metric. When the cost-to-value ratio has become too high.

How do you manage the transition? Announce a retirement date well in advance (6 months is reasonable). Provide the replacement metric. Run both metrics in parallel so teams can see the transition. Document why the metric was retired and what replaces it. Maintain historical data so you can still reference old results if needed.

What about backward compatibility? If the metric was used in published reports or organizational standards, retirement breaks continuity. You might need to keep the metric for historical reasons even if you don't use it going forward. Document this explicitly so future analysts understand why the metric still appears in dashboards.

Building Your Metric Stack

A well-designed metric stack has three layers. Core metrics: 3 metrics that directly measure performance on your most important dimensions. These are the top-line numbers you report to stakeholders. For RAG, this might be retrieval precision, end-to-end accuracy, and coverage. These three together give a complete picture of whether the system works.

Dimension metrics: 2-3 metrics for each core metric that decompose it further. For accuracy, you might have accuracy on recent questions, accuracy on questions about new products, accuracy on questions from new users. These provide diagnostic information about where the system succeeds and fails.

Leading indicators: 1-2 metrics that change quickly and correlate with core metrics but don't directly measure them. For a chatbot, token generation quality might predict conversation success without measuring conversation success directly. Leading indicators let you iterate quickly and forecast whether changes will improve core metrics.

Start with your three core metrics. For each one, implement dimension-specific variants. Then add leading indicators that help you iterate. This structure keeps the metric portfolio manageable (7-9 total metrics) while providing diagnostic depth. You avoid the pathology of tracking 30 metrics but understanding none of them deeply.

System Type Core Metrics (3) Dimension Metrics Leading Indicators
RAG Retrieval accuracy, End-to-end accuracy, Coverage By query type, by document freshness, by query complexity Retrieval ranking score, context length
Chatbot Relevance, Factuality, Tone By query intent, by conversation turn, by user type Response quality score, hallucination probability
Classifier Accuracy, Precision, Recall By class, by confidence, by input type Prediction confidence, decision boundary distance
Code Gen Test pass rate, Code efficiency, Quality By problem complexity, by language, by style Syntax correctness, type checking score

Quick Reference: VALID Checklist

  • Valid: Does it measure the right thing? Expert consensus required.
  • Actionable: Can engineering improve it? Must be interpretable.
  • Legible: Do stakeholders understand it? Should be explainable in one sentence.
  • Integrated: Does it correlate with outcomes? Validate empirically.
  • Durable: Does it work over time and contexts? Test on different time periods.
  • Cost-effective: Can you compute it at scale? Document per-unit and total costs.
  • Failure modes documented: When does it break? What are the limitations?
  • Appropriate granularity: Is it fine-grained enough to detect improvements?
  • Diagnostic value: When it changes, can you understand why?
  • Aligned with system type: Does it match how your system actually works?
The Metric Validity Crisis

Most evaluation programs use 30-50% invalid metrics—metrics that don't measure what they claim to measure. This happens because metric selection is often done hastily or inherited from academic benchmarks designed for different purposes. Running the VALID checklist on your current metrics reveals how many are actually invalid. The answer usually surprises teams.

Best Practice: Metric Documentation Template

For each metric, document: (1) Definition and formula, (2) Why this metric, (3) Validity evidence, (4) Actionability pathway, (5) Sample size required for 5% improvement detection, (6) Known failure modes, (7) Cost and latency, (8) Stakeholder interpretation guide. This forces you to think through all VALID dimensions before committing to the metric. Bad metrics become obvious during documentation.

Related Reading