Retrieval Quality Decomposition: Separating Retrieval Failures from Generation Failures

Introduction: The Diagnostic Challenge

When your RAG system produces a mediocre answer, the question is deceptively simple: Where did it fail? Was it the retriever that brought back the wrong documents? Did the generator hallucinate information? Or did it faithfully synthesize poor source material? Without decomposing these failure modes, you're optimizing blind.

Retrieval quality decomposition is the diagnostic discipline of isolating which component caused the failure. This matters enormously because the fixes are completely different: bad retrieval calls for better embeddings, reranking, or query expansion. Bad generation calls for prompt engineering, temperature tuning, or a better-calibrated base model. And some answers fail for reasons that no amount of model tweaking will fix—the source material simply isn't there.

This guide shows you how to systematically decompose RAG failures, measure component quality independently, and use that diagnostic information to prioritize improvements where they'll actually matter.

Core Premise

Retrieval and generation are separate mechanisms with separate failure modes. You cannot optimize a RAG system holistically; you must diagnose its failures, identify which layer failed, and apply targeted fixes.

The 3-Layer Diagnostic Framework

The decomposition framework divides RAG into three evaluation layers, each asking a different question:

Layer 1: Source Availability (The Data Question)

Is the information required to answer the question actually present in your retrieval corpus? This is a binary question: yes or no. If the answer is no, neither retrieval nor generation can save you—you need better data. This is measured with ideal recall: if you were to retrieve 100% of your corpus, could you find the needed information?

Layer 1 failures are not retriever failures. They're corpus failures. They require data engineering, not model tuning.

Layer 2: Retrieval Quality (The Retrieval Question)

Given that the information exists, does your retriever actually find it? This is measured by context precision and recall: what fraction of retrieved documents are relevant (precision), and what fraction of relevant documents does it retrieve (recall)? At this layer, you're evaluating whether the retriever is doing its job, holding the corpus constant.

Layer 2 failures indicate problems with embeddings, query expansion, or your reranking strategy. These are retriever problems.

Layer 3: Generation Quality (The Synthesis Question)

Given that relevant context is retrieved, does the generator synthesize it correctly and faithfully? This is measured by faithfulness (does the answer follow from the context?), relevance (does it answer the question?), and completeness (does it capture all the needed information?). At this layer, you're holding retrieval constant and evaluating whether generation works.

Layer 3 failures indicate problems with the base model, prompt design, or instruction following. These are generation problems.

Layer 1

Source
Availability

Layer 2

Retrieval
Quality

Layer 3

Generation
Quality

The logic of decomposition is simple: if Layer 1 passes but Layer 2 fails, your problem is retrieval. If Layers 1 and 2 pass but Layer 3 fails, your problem is generation. This is not an academic distinction—it determines where you spend engineering effort.

Context Precision & Recall Measurement

Context precision and recall are the quantitative instruments for evaluating retrieval quality. Both require ground truth labels: which documents are relevant to the question?

Defining Relevance

Relevance is not binary. A document can be:

Directly relevant: Contains the specific answer to the question
Supportive: Provides context or background needed to understand the answer
Tangentially relevant: Related to the topic but not needed for this question
Irrelevant: Unrelated to the question

For decomposition purposes, you typically score documents 0 (irrelevant) or 1 (relevant), where relevant means "directly or supportively relevant—necessary to answer the question." Tangentially relevant documents are typically marked as 0 for eval purposes.

Context Precision

Context precision answers: Of the documents you retrieved, what fraction were actually relevant? It's calculated as:

Precision = (Number of relevant docs retrieved) / (Total docs retrieved)

Precision is about avoiding noise. A retriever that fetches 10 documents, of which 3 are relevant, has 30% precision. High precision means your retriever is selective—it's bringing back documents it's confident about. Low precision means your retriever is noisy—it's bringing back a lot of junk alongside the signal.

In production RAG, you typically care more about precision than recall because:

LLM context windows are finite—you can't retrieve everything
Noise in the context degrades generation quality (the "lost in the middle" problem)
High-precision retrieval (3 really relevant docs) often beats high-recall retrieval (10 relevant docs plus 40 mediocre ones)

Context Recall

Context recall answers: Of the relevant documents that exist, what fraction did you actually retrieve? It's calculated as:

Recall = (Number of relevant docs retrieved) / (Total relevant docs in corpus)

Recall is about coverage. If there are 10 relevant documents in your corpus and you retrieve 3 of them, your recall is 30%. Recall measures whether your retriever has the sensitivity to find what it's looking for.

Recall is harder to calculate in practice because determining "total relevant docs in corpus" requires human annotation. For a 1M document corpus, you can't feasibly label every document. Solutions:

Sampled annotation: Randomly sample 100-200 documents from your corpus, label them, and estimate corpus-level recall
Pool-based annotation: Retrieve documents from multiple retrieval strategies, pool the results, and label. Recall is calculated against the labeled pool, not the full corpus
Oracle approach: For a development set, manually identify all relevant documents (treating your human judgment as ground truth)

The Precision-Recall Tradeoff

Retrieval strategies typically face a tradeoff: you can have high precision (bring back only documents you're very confident about) or high recall (cast a wider net), but not both without more compute.

For RAG systems, the optimal operating point is usually high precision with moderate recall. Why? Because:

Your context window is limited (typically 2K-4K tokens for retrieval)
More documents don't always mean better answers—sometimes they hurt (competing information, noise)
You can always retrieve more if needed (ask for clarification, multi-step retrieval)

A typical healthy benchmark: precision >70%, recall >60%.

Noise Sensitivity Testing

Beyond raw precision/recall, you should test how your generation degrades when retrieval is noisy. This is called noise sensitivity testing, and it reveals something precision measurements can't: is your generator robust to bad context?

The Noise Injection Protocol

Start with a clean test set where:

Questions are well-defined
Relevant documents are identified
Expected answers are specified

Then, systematically inject noise:

Noise Level 0 (baseline): Retrieve only relevant documents. Measure faithfulness and answer quality.
Noise Level 1 (light noise): For each relevant document, append 1-2 irrelevant documents. Re-measure.
Noise Level 2 (moderate noise): For each relevant document, append 3-4 irrelevant documents. Re-measure.
Noise Level 3 (heavy noise): For each relevant document, append 5-10 irrelevant documents. Re-measure.
Noise Level 4 (extreme noise): Mix relevant and irrelevant documents 50-50. Re-measure.

Plot the results: quality score (y-axis) vs. noise level (x-axis). A robust generator should show graceful degradation. A brittle generator will cliff-drop.

The Noise Reveal

If your generator's quality drops 40% when you double the noise, that's a red flag. It means your generation is fragile and will fail in production when retrieval occasionally misses.

Contradictory Context Testing

Noise isn't just irrelevant content—it can be contradictory content. This is harder to handle than noise because it looks relevant (it's about the topic) but says the wrong thing.

Test this by:

Take test questions where multiple answers are plausible or where documents disagree
Include both the correct and incorrect version of a fact in the context
Measure what the generator produces (does it pick the right one? does it hedge?)

Example: A question about COVID-19 vaccination where one retrieved document is from March 2020 (before vaccines existed) and another is from 2022. Does the generator notice the temporal contradiction?

Component Isolation Methodology

The core of decomposition is isolation: evaluating each layer independently of the others. Here's the systematic approach:

Step 1: Human Annotation of Ideal Retrieval

For a development set of 50-100 questions:

Write the question clearly
Have an expert human (ideally 2-3 for consensus) identify all documents in your corpus that would be needed to answer the question
Create a binary label: relevant (1) or not (0)
Calculate inter-rater agreement (Cohen's kappa or Krippendorff's alpha)

This creates your Layer 1 ground truth: what documents should have been retrieved?

Step 2: Measure Actual Retrieval Against Ground Truth

Run your retriever on the same questions. Calculate:

Precision: of what you retrieved, what fraction matches ground truth?
Recall: of what should have been retrieved, what fraction did you get?
F1: harmonic mean of precision and recall

This gives you Layer 2 quality. Now you know: does your retriever work?

Step 3: Measure Generation with Perfect Retrieval

This is the key isolation step. Take the ground truth relevant documents and feed them to your generator. Don't use your actual retriever—use the ideal retrieval.

Measure generation quality:

Faithfulness: Does the answer follow from the context?
Relevance: Does it answer the question?
Completeness: Does it capture all key points?

This is Layer 3 in isolation. You now know: given perfect retrieval, how good is your generation?

Step 4: Measure Generation with Actual Retrieval

Now feed your actual retriever output to the generator. This is the end-to-end system test.

Measure the same metrics. The difference between Step 3 and Step 4 tells you: how much does retrieval quality degrade generation quality?

Interpretation

Now you have four data points:

Scenario	Retrieval Quality	Generation Quality	Interpretation
Perfect retrieval + generation	N/A (ideal)	Good (Step 3)	Baseline: system works in isolation
Actual retrieval + generation	Lower (Step 2)	Lower (Step 4)	System performance drops: why?
Quality drop analysis	Retrieval gap = Step 2	Gen gap = Step 3 vs Step 4	Attribute blame to layers

If generation quality with perfect retrieval is high, but drops significantly with actual retrieval, you have a retrieval problem. If generation quality is low even with perfect retrieval, you have a generation problem.

Tools: RAGAS Component Evals & Custom pytest

RAGAS (RAG Assessment)

RAGAS is a Python framework specifically designed for decomposed RAG evaluation. It provides isolated metrics:

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall

# Evaluate with ground truth documents
results = evaluate(
    dataset,
    metrics=[
        context_precision,     # Layer 2
        context_recall,        # Layer 2
        faithfulness,          # Layer 3
        answer_relevancy,      # Layer 3
    ]
)

Key metrics RAGAS provides:

context_precision: Does retrieved context match ground truth?
context_recall: Did you retrieve all necessary documents?
faithfulness: Does the answer follow from the context?
answer_relevancy: Is the answer relevant to the question?
answer_correctness: Is the answer factually correct? (requires reference answers)

Custom pytest Protocol

For the isolation methodology above, write custom pytest fixtures:

import pytest
from your_rag_system import retriever, generator

@pytest.fixture
def test_set_with_ground_truth():
    """Load test set with human-annotated relevant documents."""
    return load_test_set("dev_set.json")

def test_retrieval_precision_recall(test_set_with_ground_truth):
    """Layer 2: Measure retrieval quality."""
    for question, relevant_docs in test_set_with_ground_truth:
        retrieved = retriever.retrieve(question, top_k=10)
        relevant_retrieved = [d for d in retrieved if d in relevant_docs]
        
        precision = len(relevant_retrieved) / len(retrieved)
        recall = len(relevant_retrieved) / len(relevant_docs)
        
        assert precision > 0.7, f"Precision too low: {precision}"
        assert recall > 0.6, f"Recall too low: {recall}"

def test_generation_with_perfect_retrieval(test_set_with_ground_truth):
    """Layer 3: Measure generation quality with ideal retrieval."""
    for question, relevant_docs in test_set_with_ground_truth:
        # Use ground truth, not actual retrieval
        answer = generator.generate(question, context=relevant_docs)
        
        # Measure faithfulness: does answer follow from context?
        faithfulness_score = evaluate_faithfulness(answer, relevant_docs)
        assert faithfulness_score > 0.8
        
        # Measure relevance: does it answer the question?
        relevance_score = evaluate_relevance(answer, question)
        assert relevance_score > 0.75

Real Debugging Case Study: DocAssist v2

A medical documentation RAG system was producing mediocre answers. The team didn't know whether to blame the retriever or generator. Here's how decomposition fixed it.

The Symptoms

Overall end-to-end faithfulness score: 62%
Users reported incorrect citations (the answer said source X, but source X didn't support it)
Some answers were incomplete

The Decomposition Analysis

Layer 1 (Ideal Recall): For 50 test questions, they manually identified all relevant documents. All 50 questions had at least one relevant document in the corpus. Layer 1 passed. ✓

Layer 2 (Actual Retrieval): They ran their retriever and measured:

Context precision: 68%
Context recall: 71%

Layer 2 showed moderate problems. 32% of retrieved documents were irrelevant (noise), and they missed 29% of relevant documents.

Layer 3 (Generation with Perfect Retrieval): They fed the ground truth documents to the generator:

Faithfulness: 89%
Answer relevancy: 87%

With perfect retrieval, the generator worked well. The problem was retrieval, not generation.

The Diagnosis

The decomposition revealed: the retriever was the bottleneck. Specific problems:

Synonymy problem: the query "diabetes type 2" wasn't retrieving documents that said "T2DM" (medical abbreviation)
Temporal problem: recent guideline updates weren't being prioritized
Noise problem: generic documents about "condition management" were flooding results

The Fix

Instead of retraining the generation model (which wouldn't help), they:

Added query expansion: map medical abbreviations and synonyms before retrieval
Boosted recency: weight documents published in last 2 years higher in ranking
Added domain-specific reranking: a small model that learned to filter generic docs

The Results

After these retrieval improvements:

Context precision: 89% (up from 68%)
Context recall: 88% (up from 71%)
End-to-end faithfulness: 87% (up from 62%)

The generation quality stayed roughly the same (~87-89%), which makes sense—they fixed retrieval, not the generator. But the system's output quality improved because now the generator was receiving better context.

Key Insight

Without decomposition, this team might have spent weeks fine-tuning the generation model or trying A/B test different prompts. Instead, they identified the root cause (retrieval) and fixed it directly. Decomposition saved engineering time and effort.

Conclusion: Decomposition as Discipline

Retrieval quality decomposition is not a one-time analysis—it's a diagnostic discipline that should become part of your evaluation culture. Every time your RAG system underperforms, ask:

Is the information in my corpus? (Layer 1)
Can my retriever find it? (Layer 2)
Can my generator synthesize it correctly? (Layer 3)

These three questions will tell you exactly where to focus. Combined with the metrics in this guide (precision, recall, faithfulness, noise sensitivity), you'll know not just that your system is failing, but why.

That diagnostic precision is the difference between systematic improvement and random tweaking.

Retrieval Quality Decomposition: Separating Retrieval Failures from Generation Failures

Introduction: The Diagnostic Challenge

The 3-Layer Diagnostic Framework

Layer 1: Source Availability (The Data Question)

Layer 2: Retrieval Quality (The Retrieval Question)

Layer 3: Generation Quality (The Synthesis Question)

Context Precision & Recall Measurement

Defining Relevance

Context Precision

Context Recall

The Precision-Recall Tradeoff

Noise Sensitivity Testing

The Noise Injection Protocol

Contradictory Context Testing

Component Isolation Methodology

Step 1: Human Annotation of Ideal Retrieval

Step 2: Measure Actual Retrieval Against Ground Truth

Step 3: Measure Generation with Perfect Retrieval

Step 4: Measure Generation with Actual Retrieval

Interpretation

Tools: RAGAS Component Evals & Custom pytest

RAGAS (RAG Assessment)

Custom pytest Protocol

Real Debugging Case Study: DocAssist v2

The Symptoms

The Decomposition Analysis

The Diagnosis

The Fix

The Results

Conclusion: Decomposition as Discipline

Key Takeaways

Ready to Master AI Evaluation?

Introduction: The Diagnostic Challenge

The 3-Layer Diagnostic Framework

Layer 1: Source Availability (The Data Question)

Layer 2: Retrieval Quality (The Retrieval Question)

Layer 3: Generation Quality (The Synthesis Question)

Context Precision & Recall Measurement

Defining Relevance

Context Precision

Context Recall

The Precision-Recall Tradeoff

Noise Sensitivity Testing

The Noise Injection Protocol

Contradictory Context Testing

Component Isolation Methodology

Step 1: Human Annotation of Ideal Retrieval

Step 2: Measure Actual Retrieval Against Ground Truth

Step 3: Measure Generation with Perfect Retrieval

Step 4: Measure Generation with Actual Retrieval

Interpretation

Tools: RAGAS Component Evals & Custom pytest

RAGAS (RAG Assessment)

Custom pytest Protocol

Real Debugging Case Study: DocAssist v2

The Symptoms

The Decomposition Analysis

The Diagnosis

The Fix

The Results

Conclusion: Decomposition as Discipline

Key Takeaways

Ready to Master AI Evaluation?

Related Lessons