Why Metric Selection Is Architecture-Specific

Using the wrong metric for your system type produces confidently wrong answers. You'll report strong numbers while shipping a broken product. This happens because different AI architectures optimize for different objectives and have different failure modes.

Metric category errors occur when you apply metrics designed for one task type to a completely different system. Accuracy is wrong for generation (ambiguous correct answers). BLEU is wrong for code (syntax matters more than n-gram overlap). MRR is insufficient for recommendations (diversity matters, not just ranking). Each system type requires metrics that align with its actual function and constraints.

This article maps system types to appropriate metrics. For each type, we cover: what to measure, which metrics to use, common pitfalls, and worked examples.

Metric Selection Principle

The right metric directly measures what matters for the system in production. If ranking quality matters (recommendations, search), use ranking metrics (NDCG, MRR). If correctness matters (code, classification), use accuracy-based metrics. If end-to-end task completion matters (agents, dialogue), measure task success. Never use a metric because it's convenient; use it because it measures what you care about.

Classification Systems

Classification is the simplest architecture: input → fixed set of categories. Evaluation depends on the structure: binary, multiclass, imbalanced, etc.

Binary Classification

Accuracy: Percentage of correct predictions. Simplest metric but problematic for imbalanced data (if 99% of examples are negative, predicting "negative" for everything gives 99% accuracy while being useless).

Precision and Recall: Fundamental tradeoff. Precision = true positives / (true positives + false positives). Recall = true positives / (true positives + false negatives). High precision means few false alarms. High recall means catching all positives. Which matters depends on the application. Medical diagnosis: maximize recall (catch all sick patients, even if some false alarms). Spam filtering: maximize precision (few false positives; missing spam is acceptable).

F1 Score: Harmonic mean of precision and recall: F1 = 2 × (precision × recall) / (precision + recall). Ranges 0–1. Single number balancing precision-recall. However, F1 assumes precision and recall are equally important. If they're not (medical diagnosis cares more about recall), use weighted harmonic mean or report both metrics separately.

Worked Example (Binary Classification):

Spam detector evaluated on 1,000 emails: 50 actual spam, 950 legitimate.


Accuracy = (30 + 920) / 1000 = 0.95 = 95%
Precision = 30 / (30 + 70) = 0.30 = 30%
Recall = 30 / (30 + 20) = 0.60 = 60%
F1 = 2 × (0.30 × 0.60) / (0.30 + 0.60) = 0.40

95% accuracy sounds great. But precision is only 30%—70% of flagged emails are false positives. Recall is 60%—40% of spam gets through. Which is worse? Depends on your application. Report all three metrics; don't hide behind accuracy.

AUC-ROC: Area Under the Receiver Operating Characteristic curve. AUC measures how well the model ranks positives above negatives. Unlike accuracy (which requires a decision threshold), AUC evaluates ranking quality across all thresholds. AUC=1 means perfect ranking. AUC=0.5 means random guessing. AUC is threshold-independent, which is useful when the decision threshold isn't determined yet, but it can mask problems. A model with AUC=0.85 might have poor precision or recall at your actual threshold. Always report AUC but also report precision/recall at your specific operating point.

PR-AUC: Precision-Recall AUC. Like AUC-ROC but using precision-recall instead of true-positive-rate/false-positive-rate. PR-AUC is more informative for imbalanced datasets (where negative class dominates). Use PR-AUC when the positive class is rare.

Multiclass Classification

When categories > 2, additional complexity: should you average metrics across classes? How do you handle class imbalance?

Macro vs. Micro vs. Weighted F1:

Example: 3-class sentiment (positive, negative, neutral). Distribution: 500 positive, 300 negative, 200 neutral.


Positive F1: 0.90
Negative F1: 0.75
Neutral F1: 0.40
Accuracy: (90% × 500 + 75% × 300 + 40% × 200) / 1000 = 0.755

Macro F1 = (0.90 + 0.75 + 0.40) / 3 = 0.683
Weighted F1 = (0.90 × 0.5 + 0.75 × 0.3 + 0.40 × 0.2) = 0.755

Macro F1 (0.683) emphasizes poor neutral performance. Weighted F1 (0.755) reflects actual distribution—most examples are positive/negative. Choose based on whether you care equally about all classes or proportionally.

Calibration Metrics: Classification metrics measure accuracy but don't measure calibration—the alignment between predicted probability and actual likelihood. A model predicting "90% confidence" should be right 90% of the time on average, not 70% or 95%.

Expected Calibration Error (ECE): Bin predicted probabilities (0–10%, 10–20%, etc.), compute mean accuracy in each bin, and measure deviation from bin label. ECE < 0.05 is well-calibrated. ECE > 0.15 indicates miscalibration.

Brier Score: Mean squared error between predicted probabilities and actual outcomes. Brier Score = (1/n) × Σ(predicted_prob - actual_outcome)². Ranges 0–1. Lower is better. Brier Score penalizes both incorrect predictions and overconfident predictions.

Cost-Sensitive Metrics

When errors have different costs, use cost-sensitive metrics. Missing a cancer diagnosis (false negative) is worse than a false alarm (false positive). Falsely flagging a transaction as fraud is worse than missing one fraud case.

Cost-Weighted Accuracy: Assign costs to each error type. Report weighted error rate:


Weighted Cost = (FN_count × cost_FN + FP_count × cost_FP) / total_examples

For cancer diagnosis: cost_FN = 100 (missing cancer is severe), cost_FP = 1 (false alarm is minor). A model with 95% accuracy but high false negative rate is worse than a 90% accuracy model with lower false negatives.

Language Generation

Generation systems produce variable-length sequences (translations, summaries, chat responses). Evaluation is fundamentally different from classification.

Why BLEU Doesn't Work

BLEU (Bilingual Evaluation Understudy) is widely used but deeply flawed. BLEU measures n-gram overlap with reference translations. It penalizes paraphrasing (correct translation with different wording), rewards repetition, and ignores semantic meaning.

BLEU Example:

Reference: "The cat sat on the mat."
Hypothesis 1: "The cat sat on the mat." (BLEU-4: 1.0, correct)
Hypothesis 2: "The cat was sitting on the mat." (BLEU-4: ~0.5, correct but penalized)
Hypothesis 3: "The the the the the the cat mat." (BLEU-4: ~0.4, fluent gibberish rewarded)

Don't rely on BLEU. Use it only for comparison with older work.

Modern Generation Metrics

chrF++ (Character F-Score): Like BLEU but uses character n-grams instead of word n-grams. More forgiving of spelling variations and morphology. chrF++ correlates better with human judgments than BLEU for many languages, especially morphologically rich languages.

BERTScore: Embeds hypothesis and reference using BERT, computes cosine similarity between token embeddings, averages to get a score. BERTScore captures semantic similarity better than BLEU. Formula: Match score = recall of matched tokens using greedy matching or Hungarian algorithm. F1-score over token embeddings. BERTScore ranges 0–1; higher is better. Typical BERTScore for good translations: 0.85–0.95.

Advantages: Semantic, handles paraphrasing. Disadvantages: Depends on BERT's training, may be weak for low-resource languages, doesn't capture all aspects of quality (fluency, adequacy).

G-Eval (GPT-4 as Judge): Use an LLM (GPT-4) to evaluate generation quality. Prompt the LLM with criteria (fluency, accuracy, completeness) and have it rate outputs 1–5. G-Eval correlates well with human judgments but introduces cost and latency. Use G-Eval for high-stakes evaluation or when reference-based metrics are insufficient.

Example G-Eval prompt:


Rate this summary 1-5 on completeness (does it capture key points?):
Summary: [model output]
Source: [original text]
Rating: [1-5]
Reasoning: [explain]

Human Preference Evaluation: Gold standard. Show humans two outputs (reference vs. model, or model A vs. model B) and ask "which is better?" Pairwise preference is more natural than absolute scoring. Aggregate preferences as win percentage or ELO rating. Budget for 20+ judgments per comparison to stabilize results.

Instruction Following (IFEval): For instruction-tuned models, evaluate whether the model follows specific constraints. "Write in exactly 3 sentences." "Use the word 'unfortunately'." Score as binary (follows/doesn't follow) or on a scale. IFEval is orthogonal to quality metrics—the output can follow instructions but be low quality, or violate instructions but be high quality. Measure both.

RAG Systems

Retrieval-Augmented Generation combines retrieval (finding relevant documents) with generation (producing answers from retrieved context). Evaluation is two-stage: retrieval quality + generation quality on retrieved context.

Retrieval Metrics

NDCG@k (Normalized Discounted Cumulative Gain): Ranks documents by relevance. Computes cumulative gain with position discount (top results weighted more): Gain_i / log2(i+1). Perfect ranking achieves NDCG@10 = 1.0. Typical strong retriever: NDCG@10 ≈ 0.8–0.9.

MRR (Mean Reciprocal Rank): Simpler: (1 / rank_of_first_relevant_document). If first result is relevant, MRR=1. If 5th result is first relevant, MRR=0.2. MRR is intuitive but binary (relevant/not relevant), not graded.

Precision@k, Recall@k: Precision@10 = relevant_in_top_10 / 10. Recall@10 = relevant_in_top_10 / total_relevant. Precision answers "how many top results are relevant?" Recall answers "what fraction of all relevant documents are retrieved?" Trade-off: retrieving more results increases recall but decreases precision.

Generation Metrics for RAG

Faithfulness: Does the generated answer stay grounded in the retrieved context, or does it hallucinate beyond the context? Measure using: (1) NLI-based checking (use entailment model to verify each claim), (2) QA-based checking (ask questions about the answer; check if answers are in context), (3) LLM-as-judge (prompt GPT-4: "Is this answer grounded in the context?").

Answer Relevance: Does the answer actually address the question? Use semantic similarity (embed question and answer, compute cosine similarity). Or use LLM-as-judge with explicit relevance rubric.

Context Precision: What fraction of retrieved documents are actually relevant to the question? Context Precision = (relevant_docs_in_context / total_docs_in_context). Good retrieval: > 0.8.

Context Recall: What fraction of all relevant documents are retrieved? Context Recall = (relevant_docs_retrieved / all_relevant_docs). Harder to measure because you need to know all relevant documents, often not feasible.

RAGAS Framework

RAGAS (Retrieval-Augmented Generation Assessment) combines these into a production-ready evaluation framework. It measures: Context Precision, Context Recall, Faithfulness, Answer Relevance. Each scored 0–1. RAGAS reports aggregate score and per-dimension breakdown. Use RAGAS to diagnose: "Is my RAG system failing at retrieval or generation?"

End-to-End Metrics

Beyond component metrics, measure task success: Did the RAG system answer the user's question? Measure as binary (answered/didn't answer) or on a scale. For production, measure user satisfaction: Do users find the answers helpful?

Conversational/Chat Systems

Dialogue systems have multiple goals simultaneously: answering questions, being coherent, maintaining persona, achieving task completion, engaging the user.

Task-Focused Metrics

Goal Completion Rate: Percentage of conversations where the system achieves the goal (customer service: resolve issue; task-oriented: successfully provide information). Binary metric: completed/incomplete. For partial completion, use a 0–1 scale.

First Contact Resolution (FCR): Percentage of issues resolved in a single turn (no follow-up needed). Related to goal completion but stricter—considers efficiency.

Quality Metrics

Coherence: Do responses logically follow from previous context? Measure using: (1) Human rating (1–5: incoherent to perfectly coherent), (2) LLM-as-judge ("Rate coherence"), (3) Semantic continuity (embed last context and response; high similarity = coherent).

Persona Consistency: If the chatbot has a defined persona (helpful assistant, customer service bot), are responses consistent with that persona? Measure using: (1) Human rating, (2) LLM-as-judge with persona rubric, (3) Self-consistency (measure whether the bot says consistent things about itself across turns).

Engagement Metrics: Does the user find the conversation engaging? Measure using: (1) User satisfaction surveys, (2) Conversation length (longer conversations = more engagement, if quality is acceptable), (3) Return rate (does user come back?), (4) Session length (how long does the user chat?).

Conversation-Level vs. Turn-Level

Some metrics apply per turn (does this response make sense?). Others apply per conversation (did the overall conversation accomplish the goal?). Always report both. A model can have high per-turn quality but fail to accomplish the overall task. Report:

Autonomous Agents

Agents are systems that perceive environment, plan actions, and iterate toward goals. Evaluation requires measuring task completion, efficiency, safety, and value alignment.

Task Success Rate

Percentage of tasks completed successfully. For complex tasks with intermediate goals, measure partial success on a 0–1 scale. Example: "Organize calendar and send meeting invite."

Step Efficiency

How many steps did the agent take relative to the optimal path?


Step Efficiency = (optimal_steps / actual_steps)

Ranges 0–1. 1.0 = perfect efficiency. 0.5 = took 2x optimal steps.

Tracks whether the agent learns efficient strategies or wastes actions.

Tool Use Accuracy

Percentage of tool calls (API calls, function invocations) that are correct. Did the agent use the right tool with the right parameters?

Example: Agent needs to "send email to [email protected]". Correct tool call: `send_email(to='[email protected]', ...)`
Error: Wrong recipient, wrong tool, or malformed parameters.

Error Recovery Rate

When the agent makes an error (tool fails, API returns error), can it recover? Measure as: (recoverable_errors_that_recovered / total_errors). Good agents: 70%+ recovery rate. Poor agents: 30%–40%.

Safety Compliance

Does the agent avoid harmful actions? Measure: (1) Safety violations caught / total actions, (2) Precision: false alarm rate (how many safe actions are blocked?), (3) Recall: miss rate (how many unsafe actions slip through?).

Value Alignment Score

For long-horizon agents, does the agent optimize for the intended objective or does it pursue proxy objectives? This is harder to measure. Common approach: expert evaluation. Experts rate whether the agent's behavior aligns with intended values. 1–5 scale. Measure per-step and per-trajectory.

Code Generation

Code evaluation requires functionality (does it run?) and quality (is it maintainable, efficient, secure?).

Pass@k (Functional Correctness)

Sample k different code generations from the model, execute against test cases. Pass = all test cases pass. Probability of at least one passing: Pass@k = 1 − (1 − pass_rate)^k.

Example: Model produces 10 code samples. 2 pass all tests. Pass rate = 0.2. Pass@10 = 1 − 0.8^10 ≈ 0.893.

Pass@k accounts for stochasticity—even if individual pass rate is low, sampling multiple candidates increases the chance of getting a working solution. Report Pass@1, Pass@10, Pass@100 to show how sampling helps.

CodeBLEU

Like BLEU for code. Measures n-gram overlap with reference code. Better than BLEU for code because it weights keywords and structure higher. Still problematic (penalizes correct alternatives), but better than raw n-gram matching.

Semantic Correctness

Beyond syntax, does the code do what it should? Execute generated code on comprehensive test suites. Measure:

Security and Maintainability

Security Vulnerability Rate: Static analysis or security expert review. Common issues: SQL injection, buffer overflows, use-after-free. Measure as vulns / lines_of_code. Good models: < 0.01 vulns/100 LOC. Poor models: > 0.1.

Maintainability Index: Composite metric based on cyclomatic complexity, lines of code, Halstead metrics. Ranges 0–100. Higher is more maintainable. Good code: 75+. Poor code: <30. Automated tools compute this; don't rely on it alone.

Documentation Quality: Does the code have comments/docstrings? Are they accurate? Measure as percentage of functions with docstrings and human rating of docstring quality.

Recommendation Systems

Recommendations are ranked lists. Evaluation focuses on ranking quality, diversity, novelty, and long-term engagement.

Ranking Metrics

Precision@k, Recall@k, NDCG@k: Same as retrieval. Precision@5 = relevant items in top 5 / 5. NDCG@5 emphasizes ranking order.

Coverage: What fraction of catalog items appear in recommendations across all users? Low coverage (recommending same 100 items to everyone) is bad. Coverage = unique_items_recommended / total_items. Ideal: high coverage (diverse recommendations).

Diversity and Novelty

Intra-List Diversity: Are recommended items diverse or are they all similar? Measure as average pairwise dissimilarity within recommendation lists. Dissimilarity can be content-based (different genres, categories) or collaborative (different user bases). Higher diversity = better if it doesn't hurt relevance.

Novelty: Are recommended items new to the user or are they obvious (items the user already knows about)? Measure as percentage of novel items (items not in user history) in recommendations. Novelty = novel_items / recommended_items. Novel items drive user engagement.

Serendipity: Are recommendations surprising but relevant? Hard to measure formally. Proxy: recommendations from low-popularity items that user eventually engages with. Serendipity = unexpected_but_liked_items / relevant_items.

Long-Term Engagement vs. Short-Term CTR

Optimization for click-through rate (CTR) can hurt long-term engagement. Users might click on sensational headlines but not enjoy the content. Measure both:

A good recommendation system balances CTR, diversity, novelty, and long-term engagement. Optimize for the right metric for your business.

Summarization

Summarization evaluation measures information coverage, accuracy, and conciseness.

Reference-Based Metrics

ROUGE-1, ROUGE-2, ROUGE-L: N-gram overlap with reference summaries. ROUGE-1 = unigram overlap, ROUGE-2 = bigram overlap, ROUGE-L = longest common subsequence. ROUGE ranges 0–1. Like BLEU, ROUGE penalizes paraphrasing but is still standard. Report ROUGE-1, ROUGE-2, ROUGE-L for completeness.

BERTScore: Semantic similarity via embeddings. Captures paraphrase summarization better than ROUGE. Use BERTScore alongside ROUGE for comprehensive evaluation.

FactScore: Measures faithfulness to source. Breaks summary into atomic facts, checks if each fact is supported in source. FactScore = facts_supported / total_facts. Penalizes hallucination. Typical good summary: FactScore > 0.9.

Abstractive Evaluation

Coverage/Completeness: Does the summary capture key information? Human raters score 1–5. Summary should cover main points but not minor details. Balance coverage with conciseness—a summary that includes everything defeats the purpose.

Conciseness Ratio: summary_length / source_length. 0.1 = 10x compression (very concise). 0.3 = 30% compression (moderate). Higher ratio = less compression. Useful for comparing summaries to target compression rate.

Multi-Dimensional Human Evaluation

Dimensions: Relevance (does it contain important info?), Consistency (accurate or hallucinated?), Fluency (grammatically correct and natural?), Coherence (well-structured?). Score each 1–5. Compute average. This multi-dimensional rating reveals which aspects need improvement.

Search and Retrieval

Search systems return ranked results to user queries. Evaluation focuses on ranking quality and efficiency.

Sparse Retrieval Metrics (BM25, Lucene)

Metrics from information retrieval: NDCG, MRR, Precision@k, MAP (Mean Average Precision). Standard benchmarks: MS Marco, Natural Questions, TREC.

Sparse retrieval baseline (BM25): Keyword matching. Good for exact queries, weak for semantic queries. Typical NDCG@10: 0.25–0.35 on hard queries.

Dense Retrieval Metrics (Embeddings)

Modern approach: embed query and documents in shared space, rank by similarity. Evaluate: (1) Ranking quality on benchmarks, (2) Embedding quality via downstream task performance, (3) Efficiency (latency, throughput).

Hybrid Search

Combine dense and sparse retrieval. Measure improvement over either alone. Hybrid approach typically achieves NDCG@10: 0.60–0.70 on hard benchmarks vs. 0.35 for sparse alone.

Production Evaluation

Benchmark metrics don't predict production performance. Measure real user behavior: Click-through rate (CTR), dwell time (how long users spend on returned results), conversion (user takes desired action), return rate (do they search again?), and satisfaction (surveys).

85%
NDCG@10 for strong dense retrieval
0.32
Mean BLEU-4 on machine translation
78%
Pass@1 strong code generation
92%
F1 for imbalanced classification
System Type Primary Metric Secondary Metrics Key Consideration
Classification Accuracy, F1 Precision, Recall, AUC-ROC Account for class imbalance
Generation BERTScore, G-Eval chrF++, Human preference Don't use BLEU alone
RAG RAGAS score Retrieval metrics, Faithfulness Diagnose retrieval vs. generation failures
Dialogue Goal completion Coherence, Engagement, Task success Multi-dimensional evaluation needed
Agents Task success Step efficiency, Safety compliance Measure safety critically
Code Pass@k Security, Maintainability, Coverage Correctness >> code style
Recommendations NDCG, Coverage Diversity, Novelty, Long-term engagement Balance multiple objectives
Summarization FactScore, ROUGE BERTScore, Conciseness, Coherence Prioritize faithfulness
Search NDCG@10 MRR, Precision@k, Real user metrics Validate with production data

Metric Selection Framework

  • Identify system type: Classification, generation, ranking, task completion?
  • Understand the objective: What does "good" mean for this system? Maximize recall or precision? Ranking quality or diversity? Task success or user engagement?
  • Choose metrics aligned with objective: Don't default to accuracy, BLEU, or AUC. Choose metrics that directly measure what matters.
  • Use multiple metrics: No single metric captures all quality dimensions. Use primary metric + secondary metrics for comprehensive evaluation.
  • Validate with human evaluation: Automated metrics correlate with human judgment, but imperfectly. Sample outputs and get human feedback.
  • Stratify results: Report metrics broken down by category, difficulty, or user segment. Aggregate metrics hide important variation.
  • Test on representative data: Benchmark metrics assume specific data distributions. Always evaluate on data representative of your deployment scenario.

Select Metrics for Your System

Match your AI architecture to the right evaluation metrics. Use this guide to identify which metrics matter for your specific system type. Remember: the right metric directly measures what you care about in production.

Get Evaluation Tools