The RAG Evaluation Challenge

Retrieval-Augmented Generation (RAG) combines two distinct components: retrieval (finding relevant documents) and generation (synthesizing an answer from retrieved documents). Evaluating RAG requires assessing both components and their interaction. A system that retrieves perfect documents but generates a poor answer fails. Conversely, a system that retrieves mediocre documents but somehow generates a great answer is lucky, not good.

The challenge differs from evaluating pure generation (where there's no retrieval component) and pure retrieval (where you just care about ranking). RAG evaluation requires metrics for: retrieval quality, generation quality given retrieved context, and their interaction. Additionally, RAG systems should be groundedresponses should be supported by retrieved documents, not hallucinated.

Evaluating RAG at production scale requires automatable metrics that correlate with user satisfaction. Human evaluation is expensive; doing it for every RAG system change is impractical. Building a suite of automated metrics that reliably measure RAG quality is therefore essential.

Retrieval Metrics: NDCG, MRR, Precision@k

NDCG@k (Normalized Discounted Cumulative Gain): Measures ranking quality. Formula:

DCG@k = rel₁ + (rel₂/log₂(2)) + (rel₃/log₂(3)) + ... + (relₖ/log₂(k))
iDCG@k = ideal DCG (all results perfectly relevant)
NDCG@k = DCG@k / iDCG@k (normalized to 0-1)

Example: Query "benefits of exercise", retrieved 5 documents.
Perfect relevance: [1, 1, 1, 1, 1] (all relevant)
Actual: [1, 1, 0, 1, 0] (docs 3,5 not relevant)
DCG = 1 + 1/1 + 0/1.58 + 1/2 + 0/2.32 = 2.5
iDCG = 1 + 1/1 + 1/1.58 + 1/2 + 1/2.32 = 3.57
NDCG@5 = 2.5/3.57 = 0.70

Interpretation: 0.70 indicates decent ranking (top documents relevant, but some non-relevant docs mixed in). NDCG values >0.8 indicate strong retrieval. Compute separately for each query, then average across your test set.

MRR (Mean Reciprocal Rank): For queries where one answer is "correct," how many documents do you have to read before finding it? MRR = average of 1/rank_of_first_correct. Example: if first correct result is rank 3, MRR contribution is 1/3 = 0.33. If first correct is rank 1, contribution is 1. High MRR (>0.8) means correct answer appears early.

Query 1: First correct result at rank 2 → 1/2 = 0.5
Query 2: First correct result at rank 1 → 1/1 = 1.0
Query 3: No correct result → 0
MRR = (0.5 + 1.0 + 0) / 3 = 0.5

Precision@k and Recall@k: Precision@k = fraction of top-k results that are relevant. Recall@k = fraction of all relevant documents that appear in top-k. Example: retrieve 10 documents, 4 are relevant to the query, 6 are irrelevant. Precision@10 = 4/10 = 0.40. If there are 8 relevant documents total in the corpus, Recall@10 = 4/8 = 0.50.

For RAG, focus on Precision (you only use top results; irrelevant results hurt). Recall matters less unless you need to find all relevant documents.

Context Relevance

Just because documents are retrieved doesn't mean they're useful for answering the question. Context Relevance measures whether retrieved context actually helps the LLM answer the question correctly.

LLM Judge for Context Relevance: Prompt an LLM: "Given this query and these retrieved documents, are the documents helpful for answering the query?" Score 1-5. Example:

Query: "What is the capital of France?"
Retrieved docs: [Wikipedia article on Paris, Article on Eiffel Tower, 
                 Article on French cuisine, Article on Roman Empire]
LLM assessment: Docs 1-2 highly relevant (4/5). Doc 3 mentions Paris but 
tangentially (2/5). Doc 4 not relevant (1/5).
Mean context relevance: (4+4+2+1)/4 = 2.75/5

Automation with Ground Truth: For factual questions with known answers, measure: does the retrieved context contain facts needed to answer correctly? Parse retrieved context, extract claims, check if claims support the correct answer. Percentage of questions where retrieved context supports correct answer = context relevance.

Answer Faithfulness

Is the generated answer grounded in the retrieved context, or does it hallucinate? Faithfulness measures the extent to which the answer stays faithful to what the documents actually say.

Atomic Claim Extraction: Break the generated answer into atomic claims (single facts that can be true/false independently). Example: answer "Paris is the capital of France and has 2.2M residents." Atomic claims: (1) "Paris is the capital of France" (2) "Paris has 2.2M residents." Then check: are these claims supported by retrieved context? Count supported vs. unsupported. Faithfulness = supported claims / total claims.

Example workflow:
Generated answer: "The Eiffel Tower was built in 1889 and is 330m tall."
Atomic claims:
  1. "Eiffel Tower built in 1889" → Check context: YES (Wikipedia says 1889)
  2. "Eiffel Tower is 330m tall" → Check context: NO (context says 324m)
Faithfulness = 1/2 = 0.5 (50% faithful)

NLI-Based Faithfulness: Use Natural Language Inference models. For each claim in the answer, ask: "Given the retrieved documents, does this claim follow logically (Entailment), contradict (Contradiction), or is it uncertain (Neutral)?" Only Entailment claims count as faithful. BART-NLI or similar models can automate this. Faithfulness = entailment claims / total claims.

Answer Relevance

Even if the answer is faithful to the context, is it actually answering the user's question? Answer Relevance measures alignment between answer and query.

Query-Answer Similarity: Compute semantic similarity between query embedding and answer embedding. High similarity (>0.7) suggests the answer addresses the query. Low similarity suggests the answer is off-topic. Use sentence transformers or similar models. Caveat: similarity alone is imperfect—a faithful, helpful answer might use different language than the query, lowering similarity.

LLM Judge for Answer Relevance: Prompt: "Does this answer actually address the query? Score 1-5." LLM judges correlate well with human judgment on relevance. Cheaper and faster than human review.

The RAGAS Framework

RAGAS (Retrieval-Augmented Generation Assessment) is a popular framework combining multiple metrics into a comprehensive RAG evaluation system. It consists of four main metrics:

1. Faithfulness: Is the answer grounded in the retrieved context? (0-1 scale). Computed using atomic claim extraction + NLI. Target: >0.8.

2. Answer Relevancy: Does the answer address the user query? (0-1 scale). Uses LLM judge or semantic similarity. Target: >0.8.

3. Context Precision: Of the retrieved documents, what fraction is actually useful? Computed as (useful docs / total retrieved). Target: >0.7.

4. Context Recall: Of all documents that could have been useful, what fraction was retrieved? Computed as (retrieved useful docs / all useful docs in corpus). Target: >0.6 (harder to achieve; requires good recall).

How to Run RAGAS: Install the library (pip install ragas). Create evaluation dataset with: query, generated answer, retrieved context. Run RAGAS evaluator on dataset. Library handles metric computation; outputs scores for each metric and sample. Aggregate to get system-level metrics. Python example:

from ragas import evaluate
from ragas.metrics import (
    faithfulness, answer_relevancy, 
    context_precision, context_recall
)

# Your eval dataset
eval_dataset = {
    'question': [...],
    'answer': [...],
    'contexts': [...],
    'ground_truth': [...]
}

results = evaluate(
    eval_dataset,
    metrics=[faithfulness, answer_relevancy, 
             context_precision, context_recall]
)

print(results)  # Shows scores for all metrics

Groundedness vs. Helpfulness Tension

There's a fundamental tension in RAG: highly faithful answers are sometimes incomplete (grounded in context but missing nuance), while more helpful answers sometimes hallucinate (adding useful context not in documents).

Example: user asks "What are the side effects of medication X?" Retrieved context lists common side effects. A faithful RAG system would list only those. But if the context is limited, users might find the answer incomplete. A less faithful system that adds context from training data might be more helpful but less grounded.

Resolving the Tension: (1) Improve retrieval to get more complete context (better context = both faithful and helpful), (2) Adjust generation instructions ("only use information from provided documents" for high-stakes domains like healthcare; more freedom for less critical contexts), (3) Measure both metrics and accept the tradeoff (report both faithfulness and user satisfaction to stakeholders).

Multi-Hop RAG Metrics

Some questions require chaining multiple documents. "Which football team won the championship in the same year Technology Company X went public?" requires: finding when X went public (doc 1), then finding which team won that year (doc 2).

Multi-Hop Specific Metrics: Standard metrics don't capture multi-hop reasoning. Additional measures: (1) Number of documents actually used in reasoning (higher usually better—more sources = more robust), (2) Reasoning chain coherence (does the reasoning logically connect documents?), (3) Intermediate step correctness (if multi-hop answer requires step 1 → step 2 → final answer, evaluate each step).

Evaluation Protocol: For multi-hop datasets, decompose answers into steps. Example: "Company X went public in 2010 (step 1); Team Y won in 2010 (step 2); Therefore, Y won in X's IPO year (final)." Score each step's correctness. Multi-hop accuracy = (correct steps / total steps).

Citation Quality Metrics

RAG systems should cite which documents support which claims. Citation accuracy is critical: a false citation is worse than no citation.

Citation Accuracy: For each claim in the generated answer, check: did the system cite a document that actually supports that claim? Count accurate citations vs. inaccurate citations. Citation accuracy = accurate / total citations.

Citation Coverage: What fraction of claims are cited? Some systems fail silently (make uncited claims). Coverage = cited claims / total claims. Target: >90% of claims cited.

Citation Hallucination: Some systems cite documents that don't exist or don't support the claim. This is particularly problematic in legal/medical domains. Measure: percentage of citations where the cited document actually contains the claimed information. This requires parsing documents and checking claims programmatically.

RAG Metric Selection Guide

Legal Document Search RAG: Prioritize: Context Precision (only want highly relevant documents—legal precedent must be correct), Answer Faithfulness (can't make up facts in legal contexts), Citation Quality (citations to specific statutes/cases are critical), Recall (need to find all relevant precedents). Metrics: NDCG, Faithfulness, Context Precision, Citation Accuracy. De-emphasize: general fluency, style.

Medical Knowledge RAG: Prioritize: Faithfulness (medical hallucinations are dangerous), Answer Relevance (must address patient concerns), Context Precision (only credible sources), Safety (flag potentially harmful advice). Metrics: Faithfulness, Answer Relevance, Context Precision, plus custom safety checks. Recall less important (might have docs, might not, but what matters is accuracy of what's returned).

Customer Support RAG: Prioritize: Answer Relevance (must address customer issue), Helpfulness (customers judge by whether issue resolved), First Contact Resolution (metric: did customer get answer without escalation), User Satisfaction. Metrics: Answer Relevance, CSAT, Escalation Rate. Faithfulness matters but slightly less critical than in medical/legal.

Internal Knowledge Base RAG: Prioritize: Relevance (employees want quick answers), Freshness (outdated docs are worse than no docs—track document age), Recall (employees want comprehensive answers). Metrics: Precision@10, Recall, Answer Relevance, plus freshness metrics (% of retrieved docs <6 months old).

4
RAGAS Metrics
0.82
Avg Faithfulness
0.75
Avg Relevance
0.68
Avg Context Recall
94%
Citation Accuracy
0.79
NDCG@10
Start with RAGAS

If you're building a RAG system, implement RAGAS first. It covers the essentials (faithfulness, relevance, context quality). Add domain-specific metrics afterward. RAGAS provides a solid foundation that scales across RAG types.

Avoid Hallucination in High-Stakes Domains

In medical, legal, or financial RAG, faithfulness is non-negotiable. Don't trade grounding for fluency. A stiff, technically accurate answer beats a smooth hallucination every time. Configure generation to heavily prioritize context fidelity.

Validate All Metrics on Your Domain

RAGAS is trained on general data. Your domain might have different characteristics. Always validate metrics on human-rated samples. If a metric scores high but users are unhappy, something's miscalibrated—investigate.

RAGAS Metrics: Formulas and Interpretation

Metric Formula / Computation Interpretation Target
Faithfulness Supported claims / Total claims (via NLI) 0-1. Higher = answer grounded in context >0.8
Answer Relevancy Query-answer semantic similarity or LLM judge 0-1. Higher = answer addresses query >0.8
Context Precision Useful docs / Total retrieved docs 0-1. Higher = no irrelevant docs retrieved >0.7
Context Recall Retrieved useful docs / All useful docs in corpus 0-1. Higher = finds all relevant docs >0.6
NDCG@k DCG@k / Ideal DCG (see text) 0-1. Higher = better ranking >0.75
Citation Accuracy Accurate citations / Total citations 0-1. Higher = citations correct >0.95

RAG Metric Selection Guide by Use Case

Use Case Primary Metrics Secondary Metrics Gating Criteria
Legal Search Context Precision, Faithfulness, Citation Accuracy Recall, NDCG Citation accuracy >95%, Faithfulness >0.85
Medical QA Faithfulness, Answer Relevance, Safety Checks Context Precision, CSAT Faithfulness >0.9, Zero harmful advice
Customer Support Answer Relevance, FCR, User Satisfaction Precision, Response Time Relevance >0.75, FCR >80%
Internal Wiki Precision@10, Recall, Answer Relevance Freshness, NDCG Precision >0.75, <5% outdated docs

Key Takeaways

  • RAG Has Two Components: Retrieval and Generation. Evaluate both. Don't optimize one at the expense of the other.
  • Faithfulness is Critical: Hallucinated answers, even if fluent, are worse than grounded but stiff answers. Measure faithfulness as a gating criterion.
  • RAGAS Framework is a Good Start: Covers essentials (faithfulness, relevance, context quality). Implement it first, then add domain-specific metrics.
  • Different Domains Need Different Metrics: Legal RAG cares about precise citations. Medical RAG cares about safety. Customer support cares about resolution. Choose metrics matching your domain.
  • Validate Metrics on Your Data: Off-the-shelf metrics are trained on general data. Always validate on human-rated samples in your specific domain.
  • Balance Groundedness and Helpfulness: Improve retrieval to resolve this tension. Better context enables both faithful and helpful answers.

Implement RAG Evaluation Now

Start with RAGAS: install the library, build an evaluation dataset (50-100 queries with retrieved docs and answers), run metrics. Get baseline scores. Identify weak areas. Iterate on retrieval and generation to improve metrics.

Find RAG Tools