The RAG Failure Taxonomy: Comprehensive Classification and Diagnosis

Introduction: Naming Failure Modes

When your RAG system produces a bad answer, it's tempting to blame it on "bad retrieval" or "bad generation." But that's too coarse. There are at least 12 distinct failure modes, each with different root causes, different detection methods, and different fixes.

Understanding the taxonomy allows you to diagnose precisely what went wrong. Precision in diagnosis leads to precision in fixes.

73%

of failed RAG deployments had undiagnosed mixed-failure patterns

42%

of failure incidents involve 2+ failure types simultaneously

89%

of organizations misdiagnose root cause without systematic taxonomy

Retriever Failures (1-9) — Detailed Analysis

Failure 1: Missed Retrieval — Relevant Document Not Retrieved

Definition: The relevant document exists in your corpus but your retriever doesn't return it. The answer is knowable, but unretrievable.

Root Causes:

Embedding Space Mismatch: The query embedding is far from the document embedding in vector space
Query-Document Language Mismatch: Query uses different terminology than documents (synonyms, paraphrases)
Semantic Distance Too Large: Query intent fundamentally different from how documents express the concept
Ranking Threshold Too High: Document passes similarity threshold but ranked below cutoff

Real-World Example: A customer support RAG trained on technical documentation fails to retrieve answers about "activation codes" because the documentation calls them "serial number authentication tokens." The embeddings never align.

Detection Methods:

For every failed answer, manually search your corpus for documents that would have fixed it
If found, measure the similarity score between query and document
Compare this "oracle retrieval score" to your actual retrieval threshold
Analyze the embedding space distance using tools like UMAP to visualize misalignment
Run query expansion experiments and measure oracle retrieval score improvement

Fixes to Try (in order of effectiveness):

Query Expansion: Automatically generate synonyms, paraphrases, and related terms (use LLM to expand queries)
Better Embeddings: Switch to domain-specific embedding model (e.g., for legal or medical)
Fine-tune Embeddings: Finetune existing embedding model on your domain data with relevance pairs
Synonym Dictionary: Build domain-specific synonym mapping (technical support: "activation code" → "serial token" → "authentication key")
Multi-Retrieval Strategy: Retrieve with multiple similarity metrics (cosine, euclidean, semantic) and ensemble results
Hybrid Search: Combine dense (embedding) + sparse (BM25 keyword) retrieval to catch both semantic and lexical matches

Diagnostic Signals: Context recall is low. Precision might be reasonable. Average similarity score of top-K retrieved documents is lower than expected.

Failure 2: Wrong Chunk — Right Document, Wrong Section

Definition: Your retriever finds a relevant document but returns the wrong section/chunk. The document contains the answer, but in a different part than what was retrieved.

Root Causes:

Poor Chunking Strategy: Long documents split at content boundaries that break meaning
Loss of Context Between Chunks: Important context lives in previous chunk, current chunk references it ambiguously
Embedding at Wrong Granularity: You embed at paragraph level but answer spans multiple paragraphs
Insufficient Chunk Overlap: Moving window chunking without enough overlap loses connecting sentences

Real-World Example: A legal document retriever splits a contract at natural section breaks. Query: "What happens if I breach the warranty?" The warranty clause is in section 5. The retriever returns section 3 (general conditions) which mentions warranties but not breach consequences. The breach remedies are in section 7.

Detection Methods:

For failed answers, manually find the document that contains the right answer
Check if your retriever returned any chunk from that document
If yes, compare the returned chunk to the ideal chunk
Analyze: were they from the same document? If yes, this is wrong chunk failure
Calculate "chunk recall within document" — of documents you retrieved, what % contained the answer?

Fixes to Try (in order of effectiveness):

Overlapping Chunks: Use sliding window with 50-100 token overlap. This ensures context carries forward
Semantic Chunking: Split documents at semantic boundaries (using sentence embeddings) rather than fixed sizes
Hierarchical Chunking: Create chunks at multiple levels (section, paragraph, sentence) and retrieve at multiple granularities
Context-Preserving Chunks: Append section headers and previous sentence to each chunk for context
Finer-Grained Embeddings: Embed at sentence level instead of paragraph level for finer retrieval precision

Diagnostic Signals: You retrieve documents that are relevant (high precision on document-level) but chunks within those documents are wrong (low precision on chunk-level).

Failure 3: Stale Context — Outdated Information

Definition: The retrieved document is factually outdated. It was correct when written but facts have changed. The system confidently provides old answers.

Root Causes:

Corpus Not Updated: Your training data is months or years old
No Recency Bias: Ranking treats old and new documents equally
Multi-Versioned Documents: You have both old and new versions in corpus; retriever picks old one
Time-Sensitive Domain: Price lists, regulations, policy documents change frequently but corpus static

Real-World Example: A finance RAG system answers questions about tax policy. A new tax law passed in 2025, but the corpus still contains 2023 regulations. The system retrieves the old regulation and confidently provides advice that violates current law.

Detection Methods:

Implement temporal tagging — every document has publication/update date
For failed answers, check publication date of retrieved documents
Correlate answer errors with publication date gaps
Build a time-series eval set: same question answered on multiple dates, comparing to ground truth by date
Track which domains have high failure rate for time-sensitive queries

Fixes to Try (in order of effectiveness):

Update Corpus Regularly: Establish cadence for re-embedding and re-indexing (weekly for fast-moving domains)
Recency Ranking: Weight retrieval by document freshness. Recent documents get score boost
Version Management: Identify and remove old versions of documents automatically
Time-Window Restriction: For time-sensitive queries, only retrieve documents published after cutoff date
Live Data Integration: For highly time-sensitive data, skip retrieval and fetch from live systems (APIs, databases)
Deprecation Notices: Mark old documents as deprecated; retriever deprioritizes them but keeps for historical queries

Diagnostic Signals: Errors correlate with time. Recent questions answered correctly; older questions answered with outdated facts. Retrieved documents consistently predate known fact changes.

Failure 4: Hallucinated Source — Wrong or Invented Citation

Definition: The generator cites a specific document that doesn't support the claim, or invents a source that doesn't exist in your corpus.

Root Causes:

Pattern Matching From Training Data: LLM learned to cite like it saw in training, not actually reading retrieved context
Overconfidence: Generator makes up plausible citations because it's confident in the answer
Ambiguous Context: Retrieved context is loosely related; generator extrapolates and confabulates link
Prompt Incompletely Enforces Citation: Your prompt says "cite sources" but doesn't verify or penalize hallucination

Real-World Example: Query: "What year was GPT-4 released?" Generator answers: "GPT-4 was released in March 2023 according to the OpenAI press release #47." But your corpus has no document with ID #47. The generator invented the citation.

Detection Methods:

Citation Verification: For every answer with a citation, verify the cited document actually exists in your corpus
Claim Matching: Extract claims from answer and search for supporting text in cited document
Automated Citation Auditing: Use an LLM to evaluate: "Does this citation actually support the claim?" on sample
Citation Drift Monitoring: Track % of answers with valid citations over time
Impossible Citation Detection: Flag citations with format/ID that doesn't exist in corpus

Fixes to Try (in order of effectiveness):

Retrieval-Guided Generation: Require generator to cite directly from retrieved context; use templates like "[Claim] (Document X, line Y)"
Citation Verification Step: Add explicit verification step: before returning answer, verify each citation against corpus
Constrained Generation: Force LLM to only cite from provided documents using constrained decoding
Prompt Reinforcement: Add penalties in prompt: "If you cite a source, it must be from the provided documents. Do not hallucinate sources."
Fine-tuning for Accuracy: Fine-tune LLM on examples where citations are verified correct

Diagnostic Signals: Citations don't match documents or contain impossible document IDs. High percentage of answers cite sources; low percentage of those citations are actually valid.

Failure 5: Conflation — Incorrectly Mixing Multiple Sources

Definition: The retriever brings back two unrelated documents, and the generator incorrectly combines them into a false claim.

Root Causes:

Entity Ambiguity: Same entity name in different documents with different properties (e.g., "Apple Inc." vs. "apple the fruit")
Generator Scope Confusion: Generator doesn't track which claims come from which documents
Retrieved Noise: Irrelevant documents retrieved alongside relevant ones; generator combines them
Implicit Relationships: Generator assumes relationship between two documents that isn't explicitly stated

Real-World Example: Query: "What is the capital of Australia?" Retriever returns: (1) "Canberra is the capital of Australia" + (2) "Sydney is Australia's largest city." Generator conflates these: "The capital of Australia is Sydney Canberra" or "Sydney is the capital, which is Australia's largest city."

Detection Methods:

For false claims in answers, trace the claim back to retrieved sources
Check if the claim requires combining information from multiple documents
Analyze: is the combination explicitly supported or implied/inferred?
Compare properties across documents for same entity; flag contradictions
Build a "conflation dataset" of false claims from real failures and audit them

Fixes to Try (in order of effectiveness):

Entity Linking: Resolve entities to canonical IDs; system knows when "Apple Inc." != "apple"
Source Attribution in Prompts: Explicitly require LLM to tag each claim with source document: "[Fact1 - Doc A] [Fact2 - Doc B]"
Scope Checking: Add prompt: "Do not combine facts from different documents unless explicitly related"
Retriever Denoising: Use reranker to remove noisy irrelevant documents before generation
Fact Verification: Post-generation verification: for multi-source claims, verify relationship actually exists

Diagnostic Signals: False claims that are "locally true" — each component comes from a document, but the combination is false. Answers cite multiple documents for a single claim.

Failure 6: Over-Retrieval Noise

Definition: Your retriever returns many documents, but most are noise. The relevant information is buried under irrelevant context.

Root Causes:

Loose Threshold: You retrieve top-K documents where K is large or similarity threshold is too low
Broad Query Matching: Query matches many documents tangentially but none deeply
Corpus Quality Issues: Your corpus contains duplicate or near-duplicate documents that all get retrieved
Retrieval Strategy Too Permissive: You use OR-based logic when AND-based would be tighter

Real-World Example: Query: "How do I reset my password?" Retriever returns 50 support articles about "account management" which mention passwords in passing. Only 5 of the 50 directly address password reset. The generator is overwhelmed by noise.

Detection Methods:

Calculate Context Precision: Of top-K retrieved documents, what % are relevant to the query?
Measure Average Rank Position: Of retrieved relevant documents, what position do they appear? (Lower is better)
Audit false negatives: for wrong answers, check if relevant doc was retrieved but ranked low
Sample random answers and manually rate retrieved context for relevance
Compare generator output with/without top-K documents to measure noise impact

Fixes to Try (in order of effectiveness):

Tighter Retrieval: Reduce K or raise similarity threshold; retrieve fewer documents
Better Ranking: Add a reranker (cross-encoder) that re-ranks retrieved documents
Query Refinement: Use LLM to refine query before retrieval (more specific → tighter results)
Multi-Stage Retrieval: Retrieve large pool, coarse-rank, then fine-rank for top-K
Deduplication: Remove near-duplicate documents from retrieval results

Diagnostic Signals: Context precision is low. Relevant documents are retrieved but mixed with many irrelevant ones. Generator struggles to extract signal from noise.

Failure 7: Under-Retrieval Gap

Definition: You retrieve too few documents, missing relevant information that would have completed the answer.

Root Causes:

Conservative Retrieval: You retrieve top-K where K is small or similarity threshold is very high
Insufficient Coverage: Single document doesn't contain full answer; you need multiple documents
Information Scattered: Answer requires combining information from documents; you retrieve too few to synthesize
Edge Case Sparsity: Query addresses rare scenario; fewer documents cover it

Real-World Example: Query: "What are the side effects, contraindications, and interactions of drug X?" You retrieve only the main drug information document but miss the separate pharmacovigilance report and the drug-interaction database entries. Answer is incomplete.

Detection Methods:

Calculate Context Recall: Of all documents that contain relevant information, what % did you retrieve?
Measure Answer Completeness: Of key facts needed to answer fully, how many are in retrieved context?
For incomplete answers, manually search corpus: do additional documents exist that would fill gaps?
Compare answer quality with retrieval count: does quality improve with more retrieved documents?
Analyze query complexity: multi-hop, multi-document queries show lower completion rates

Fixes to Try (in order of effectiveness):

Increase Retrieval Count: Retrieve more documents (increase K)
Lower Threshold: Lower similarity threshold to include marginal documents
Multi-Hop Retrieval: Retrieve initial results; use LLM to identify missing information; retrieve again
Iterative Refinement: Ask generator "what additional information would help?" and retrieve based on answer
Query Decomposition: Break complex query into sub-queries; retrieve for each independently

Diagnostic Signals: Context recall is low. Answers are incomplete or vague. Generator often says "information not provided in context."

Failure 8: Embedding Drift

Definition: Your embedding model has degraded or the training distribution has drifted. Queries and documents no longer align well in embedding space.

Root Causes:

Domain Evolution: Your domain vocabulary evolved but embeddings trained on static data
Corpus Expansion: New documents added with different vocabulary/style than originals
Query Language Shift: Users asking questions in new ways not seen in training
Catastrophic Embedding Forgetting: If you retrained/finetuned embeddings, new model misaligns with old corpus

Real-World Example: Medical RAG system trained embeddings 2 years ago on academic papers. During COVID-19, users asked about vaccine efficacy, long COVID, variants — terminology the embeddings never saw. Retrieval quality degraded 40% in 6 months.

Detection Methods:

Embedding Drift Monitoring: Track average retrieval precision/recall over time
Embedding Quality Metrics: Compare new embeddings to old on held-out queries
Vocabulary Coverage: Analyze new queries/documents; % of tokens not in embedding training data
Vector Space Analysis: Use UMAP to visualize embedding space; check for clustering degradation
Comparison Testing: Baseline old embedding model vs. new on same eval set

Fixes to Try (in order of effectiveness):

Retrain Embeddings: Periodically retrain on current corpus + current query distribution
Fine-Tune Embeddings: Fine-tune existing embedding model on your domain data with human relevance labels
Upgrade Embedding Model: Switch to newer embedding model (e.g., from sentence-transformers) trained on more recent data
Continuous Retraining: Auto-retrain embeddings monthly with new data
Domain-Specific Embeddings: Use embeddings trained specifically for your domain (legal, medical, finance-specific models)

Diagnostic Signals: Performance degradation over time. Queries that worked 3 months ago now fail. New domains/vocabularies have lower retrieval quality than established ones.

Failure 9: Query Reformulation Failure

Definition: Your system reformulates the user query to improve retrieval, but the reformulation makes things worse.

Root Causes:

Reformulation Model Confusion: LLM used to reformulate misunderstands original intent
Over-Generalization: Reformulation broadens query too much (too permissive)
Over-Specialization: Reformulation narrows query too much (too restrictive)
Paraphrase Failure: Attempted paraphrase introduces new query intent not in original

Real-World Example: User query: "What's the difference between machine learning and deep learning?" Reformulation: "Define machine learning. Define deep learning." Now you have two separate queries, losing the comparative intent. Results discuss each separately but don't compare.

Detection Methods:

Log both original and reformulated queries for every request
For failed answers, compare: did original query work better than reformulated?
Manually audit reformulations for semantic equivalence to original
Measure retrieval quality: original query embeddings vs. reformulated
Test: does retrieval improve with reformulation (on oracle eval set)

Fixes to Try (in order of effectiveness):

Better Reformulation Model: Use more capable LLM for reformulation (e.g., GPT-4 instead of GPT-3.5)
Reformulation Validation: Verify reformulated query retrieves better than original before using it
Query Expansion Over Reformulation: Instead of rewriting query, expand it (keep original + add synonyms)
Ensemble Multiple Reformulations: Generate 3-5 reformulations; retrieve with all; ensemble results
Disable Reformulation: Not all queries benefit; use original query for specific query types

Diagnostic Signals: Reformulated queries have lower retrieval quality than original. Reformulations change fundamental intent of question. Queries about comparisons get split into separate queries.

Generator Failures (10-12)

Failure 10: Reranker Errors

Definition: Your reranker (if you have one) ranks relevant documents low, causing them to be dropped before generation.

Root Causes:

Relevance Mismatch: Reranker trained on different definition of relevance than you need
Domain Shift: Reranker trained on general web data; you're using it on specialized domain
Insufficient Training Data: Reranker underfits; no labeled training data from your domain
Threshold Miscalibration: You set cutoff score too high; relevant documents dropped

Real-World Example: You use a generic reranker trained on MS-MARCO data. For medical queries, it downranks documents with "discuss with your doctor" (appropriate caveating) while upranking confident assertions. Its relevance definition is medical-unsafe.

Detection Methods:

Compare retriever ranking vs. reranker ranking on eval set
For failed answers: were relevant documents ranked higher by retriever than reranker?
Measure reranker accuracy on domain-specific test set
Analyze: reranker score distribution; check for bias toward/against certain document types

Fixes to Try (in order of effectiveness):

Domain-Specific Reranker: Fine-tune reranker on your domain (requires labeled relevance data)
Retrain Reranker: If your relevance definition changed, retrain from scratch
Lower Threshold: Increase number of documents passed to generator
No Reranking: For some domains, better to skip reranking and increase retrieval count
Multi-Stage Ranking: Use multiple rerankers with different strengths; ensemble their scores

Diagnostic Signals: Retriever quality is good; generator quality worse than expected. Relevant documents retrieved but ranked low by reranker.

Failure 11: Context Window Overflow

Definition: Your retrieved context is so large that relevant information gets cut off or pushed out of the context window, becoming unavailable to generator.

Root Causes:

Too Many Documents Retrieved: You pass 50 documents to LLM with 4K context window
Small Context Window: Using model with small context limit relative to your corpus size
Long Document Format: Your chunks are very large; a few chunks fill context window
Inefficient Context Format: You include metadata, headers, formatting that takes tokens

Real-World Example: You retrieve 20 documents (average 500 tokens each) = 10K tokens. Your LLM has 4K context window. Only first 3-4 documents fit. Generator never sees documents 5-20, even if they're more relevant.

Detection Methods:

Monitor context token count vs. context window size
Track what % of retrieved context actually used by generator
For failed answers: were relevant documents in part of context that got truncated?
Analyze token distribution: first-K documents vs. remaining documents in retrieved context

Fixes to Try (in order of effectiveness):

Reduce Retrieval Count: Retrieve fewer documents; prioritize quality over quantity
Upgrade Model Context: Use LLM with larger context window (e.g., Claude 200K vs. 4K)
Compress Context: Compress retrieved documents before passing to generator (abstractive or extractive summarization)
Better Ranking: Ensure top-ranked documents are truly most relevant; eliminate less important ones
Hierarchical Context: Pass full documents for top-3 results; abstracts/summaries for documents 4-10

Diagnostic Signals: Context token count consistently near or exceeds context window. Generator output quality improves when fewer documents provided. Last retrieved documents never contribute to output.

Failure 12: Citation Fabrication

Definition: The generator produces an answer that's correct, but cites a source that doesn't actually support the claim, or invents a source that doesn't exist in your corpus.

Root Causes:

Pattern Matching From Training: LLM learned to cite from pre-training data, not from provided context
Confidence Overrides Accuracy: Generator is confident enough to cite, even though it's making up the citation
Lost Context: Generator forgets what was actually provided vs. what it inferred
Subtle Citation Drift: Citation approximately correct but details wrong (minor hallucination)

Real-World Example: Query: "How many acres is Central Park?" Correct answer: 843 acres. Retrieved document says 843. Generator says: "843 acres, per NYC Parks Department Report 2023." But your corpus has no report with that title — generator invented it.

Detection Methods:

For every citation, verify document actually exists in your corpus
Verify the cited text actually supports the claim (automated or human)
Measure citation accuracy rate over sample of answers
Flag impossible citations (wrong format, non-existent IDs)
Check: is the citation necessary for the claim, or just cosmetic?

Fixes to Try (in order of effectiveness):

Retrieval-Guided Generation: Force generator to cite only from provided documents using special tokens
Post-Generation Verification: After generating answer, verify each citation before returning
Constrained Decoding: Use vLLM or similar to constrain citations to valid document IDs
Fine-Tuning for Accuracy: Fine-tune LLM on examples with verified correct citations
No-Citation Option: Allow "I don't know" answers without citation rather than hallucinated citations

Diagnostic Signals: Citations exist but don't actually support claims. Citation IDs/names don't match documents in corpus. Answer is correct but citations are incorrect/made-up.

Compound Failures — When Multiple Failures Cascade

Why Compound Failures Matter: Most real-world RAG failures aren't single-mode. They're combinations. And the impact isn't additive — it's multiplicative.

Failure Combination Patterns:

Pattern A: Missed Retrieval + Hallucination

Relevant document isn't retrieved (Failure 1). Generator, lacking context, hallucinates an answer with fabricated citation (Failure 12). Result: Wrong answer, confident citation. System is very broken.

Example: User: "What's the policy on employee remote work?" Retrieval misses the remote work policy document. Generator: "You can work remotely 2 days/week per the Employee Handbook 2025" (fabricated — no such policy exists). User believes this and works remotely; gets reprimanded.

Pattern B: Over-Retrieval Noise + Conflation

Retriever returns 30 documents, most irrelevant (Failure 6). Generator mixes unrelated documents into a false claim (Failure 5). Result: Confident, incorrect synthesis.

Example: Query: "What are the risks of AI adoption?" Retrieval returns: AI papers, risk management papers, adoption case studies. Generator conflates: "AI adoption has the same risks as traditional technology adoption (low)" + "AI has existential risks (high)" into: "AI adoption has existential risks."

Pattern C: Wrong Chunk + Stale Context

Retriever finds document but wrong chunk (Failure 2). That chunk is from 2022 (Failure 3). Result: Confidently wrong outdated information from the right document.

Example: Query: "What's the current price of API access?" Document about pricing exists (2022 version retrieved). Retrieved chunk: "API costs $10/1M tokens." Today's price is $5. User bases pricing model on outdated data.

Pattern D: Under-Retrieval Gap + Incomplete Generation

Too few documents retrieved (Failure 7). Generator can't synthesize complete answer. Result: Incomplete but confident answer that sounds complete.

Example: Medical query requires information from drug safety database + patient population study + clinical trial. You only retrieve the safety database. Answer is incomplete about efficacy in specific populations but doesn't signal that.

Detecting Compound Failures:

For each failed answer, don't stop at finding one failure mode
Ask: Could multiple issues have contributed?
Trace answer backwards: Was information missing (retrieval issue)? Was it misinterpreted (generation issue)? Both?
Build failure pattern matrix: track co-occurrence of failure types
Test fixes: if you fix Failure 1 and answer still fails, look for Failure 2

Fixing Compound Failures:

Fix the highest-impact component first. Usually: (1) Retrieval issues first (can't generate good answer from missing context), (2) Generation issues second (assuming good retrieval).

Silent vs. Loud Failures — The Confidence Problem

Silent Failures: System produces a wrong answer with high confidence. User doesn't know it's wrong. This is the most dangerous failure type.

Loud Failures: System produces an error message or explicitly says "I don't know." User knows they can't rely on the answer.

The Asymmetry: A loud failure is better than a silent failure. Users can work around loud failures. Silent failures erode trust invisibly.

Examples:

Silent Failure: Query: "What's the dosage for drug X?" System confidently provides wrong dosage. Patient takes overdose.
Loud Failure: Query: "What's the dosage for drug X?" System: "I don't have reliable information about dosages. Please consult your pharmacist."

Silent Failure Causes:

Hallucinated answers that sound plausible
Correct answers to wrong interpretations of the question
Outdated information presented as current
Wrong chunk that's plausible but incorrect

Detecting Silent Failures:

Use LLM-as-judge to evaluate answer quality, not just accuracy
Look for confidence signals: hedging language, caveats, "I'm not sure" indicates loud failure (good). Absolute statements indicate silent failure risk
Measure: what % of wrong answers were presented with high confidence?
For high-stakes domains (medical, legal, financial), sample answers and manually verify

Converting Silent to Loud Failures:

Confidence Scoring: Add mechanism that estimates confidence in answer; return "I'm not confident" for low-confidence answers
Verification Step: Before returning answer, verify it against retrieved context; flag low-confidence answers
Fallback Behavior: For low-confidence answers, provide "I'm not sure, but here's what I found" instead of confident wrong answer
Uncertainty Quantification: Use Bayesian methods or ensemble diversity to estimate answer uncertainty
Explicit Disclaimers: For high-risk domains, always include: "This is information only, not professional advice"

Failure Diagnosis Protocol — Step-by-Step

When a RAG system produces a wrong answer, use this protocol to diagnose root cause:

Step 1: Verify the Failure (2 min)

Is the answer actually wrong? (Not: is it incomplete or could be better?)
Confirm against ground truth / oracle / domain expert
Document: the query, the answer, why it's wrong

Step 2: Analyze Retrieval (5 min)

What documents were retrieved?
Are they relevant to the query? (Document-level evaluation)
Do they contain information that would have produced a correct answer?
If relevant doc not in top-10: Failure 1 (Missed Retrieval)
If relevant doc in results but not in top-K passed to generator: Failures 6-7 (Over/Under-retrieval) or 10 (Reranker)
Check document publication dates: old documents retrieved? Failure 3 (Stale Context)

Step 3: Analyze Context Quality (5 min)

Is the relevant information in the retrieved context presented clearly?
Is the right chunk included or just the document?
Is context missing crucial information? Failure 2 (Wrong Chunk) or 7 (Under-Retrieval)
Is context unclear or ambiguous? Might cause conflation (Failure 5)
Is the context window too small relative to retrieved documents? Failure 11 (Context Window Overflow)

Step 4: Analyze Generation (5 min)

If retrieval is good, why did generation fail?
Does the answer match any retrieved information? Or is it hallucinated? Failures 4, 12 (Hallucination, Citation Fabrication)
Does the answer incorrectly combine information from multiple documents? Failure 5 (Conflation)
Are citations valid? Failure 4, 12 (Hallucinated/Fabricated Sources)
Does the answer ignore retrieved context and rely on pre-training? Failure 12 (Citation Fabrication)

Step 5: Root Cause Classification (3 min)

Based on Steps 2-4, identify which failure mode(s)
Prioritize: is it primarily a retrieval issue or generation issue?
Document the root cause with evidence

Step 6: Design Fix (10 min)

Based on failure mode, what fixes to try (in priority order)?
Identify what metrics would validate the fix works
Plan experiment: would the fix have prevented this failure?

Time Budget: 30 minutes per failure diagnosis

RAG Failure Rates by Industry

Research from deployed RAG systems shows typical failure rate distributions. Your rates may differ, but these provide a baseline for what's "normal."

Industry	Typical Failure Rate	Most Common Mode	Second Most Common	Recommendations
Healthcare/Medical	8-12%	Hallucination (45%)	Missed Retrieval (28%)	Strict verification; domain embeddings; citation requirements
Legal	6-10%	Stale Context (38%)	Wrong Chunk (24%)	Frequent corpus updates; semantic chunking; recency bias
Finance/Banking	5-8%	Conflation (35%)	Stale Context (30%)	Entity linking; daily updates; source separation
Customer Support	12-18%	Missed Retrieval (40%)	Over-Retrieval Noise (25%)	Query expansion; user feedback loop; frequent reranking training
Technical Documentation	4-7%	Stale Context (42%)	Wrong Chunk (22%)	Versioning; timestamp tracking; update notifications
E-Commerce Product Info	10-15%	Stale Context (38%)	Missed Retrieval (32%)	Real-time catalog sync; inventory updates; query expansion

20-Item Pre-Deployment Quality Gate

Before deploying your RAG system to production, verify these 20 items. Each should have a passing test or documented exception.

Deployment Readiness Checklist

Retrieval Quality (7 items):
1. Retrieval precision ≥ 0.85 on test set (documents are relevant)
2. Retrieval recall ≥ 0.80 on test set (relevant documents are found)
3. Embedding model tested on domain data; not just off-the-shelf
4. Query expansion tested; improves recall without hurting precision
5. Chunking strategy validated; chunks are cohesive and don't break context
6. Stale document detection implemented or corpus freshness verified
7. Retrieval latency acceptable for use case (goal depends on application)

Generation Quality (6 items):
8. Answer accuracy ≥ 95% on high-risk queries (medical, legal, financial)
9. Citation accuracy ≥ 95% (if citations provided)
10. No hallucinated sources detected in sample of 100 answers
11. Conflation test passed: system doesn't incorrectly mix documents
12. Confidence signals appropriate: wrong answers flagged as uncertain
13. Length of answers reasonable (not too long/short for context window)

Robustness (4 items):
14. Edge cases tested: ambiguous queries, out-of-domain queries, adversarial inputs
15. Multi-language support (if applicable): tested for your languages
16. Empty/error handling: system degrades gracefully when retrieval fails
17. Performance under load tested: latency acceptable at 10x expected QPS

Monitoring (3 items):
18. Logging implemented: every query logged with retrieval results and generation output
19. Metrics dashboard created: track precision, recall, answer accuracy, latency
20. Incident response plan written: what to do if failure rate spikes

Case Study: Diagnosing a Real Production RAG Failure

The Incident:

A financial services company deployed a RAG system to help employees answer customer questions about investment products. After 2 weeks in production, they noticed:

11% of generated answers were factually wrong
When wrong, answers were presented confidently (silent failures)
Wrong answers mostly about product fees and minimum investment amounts
Customer complaints about being given incorrect information

Initial Hypothesis: "The generation model is hallucinating. We need a better LLM."

Diagnosis Process (30-minute investigation):

Step 1: Verify Failure (done — documented 50 wrong answers)

Step 2: Analyze Retrieval

Sample 10 failed answers. For each, check what was retrieved:

Failure case 1: Query "What's the minimum investment for Fund X?" Retrieved documents: Fund X overview (correct), competitor comparison, general investing guide. Fund X minimum was in the overview.
Failure case 2: Query "What fees apply to Account Type Y?" Retrieved: Account Type Y overview, fee schedule (from 2023), new fee schedule (from 2024). Both retrieved.
Failure case 3-10: Similar — relevant documents retrieved.

Finding: Retrieval is working. Relevant documents are in the context. This isn't Failure 1 (Missed Retrieval).

Step 3: Analyze Context Quality

For Failure case 2: Generated answer said "Account fees are 0.5% annually per the 2023 fee schedule." But 2024 schedule shows 0.35%. Both documents retrieved, but generator used old one. The retrieved documents say different things.

Finding: Failure 3 (Stale Context). The corpus contains both old and new versions of documents. Retriever returned both. Generator picked the wrong one.

Step 4: Analyze Generation

Looking at generation prompts: System was told "Use provided documents to answer." For Failure case 2, both fee documents provided. Generator should have said "As of 2024, fees are 0.35%" but instead cited 2023.

Check: Did the generator misread the documents, or pick wrong one intentionally?

Manual inspection of 10 cases:

5 cases: Generator used older date (2023-2024 discrepancy)
3 cases: Generator misread minimum amounts (e.g., "$10,000 minimum" read as "$1,000")
2 cases: Generator conflated similar products (two Fund X variants mixed)

Step 5: Root Cause Classification

Primary causes (in order):

Failure 3 (Stale Context) — 50% of failures: Multiple versions of documents in corpus; generator picks old version
Failure 5 (Conflation) — 20% of failures: Similar products retrieved together; generator confuses them
Failure 2 (Wrong Chunk) — 20% of failures: Correct document retrieved but wrong section (e.g., minimum for account type A vs. B)
Failure 12 (Citation Fabrication) — 10% of failures: Generator cited non-existent fee schedule

Step 6: Design Fixes

In priority order:

Fix 1 (Stale Context): Remove 2023 fee schedules from corpus; keep only current + archive. Mark archive versions as "historical only." Implement date-based retrieval filtering.
Fix 2 (Conflation): Add entity linking for product IDs. When query mentions "Fund X," tag it with canonical product ID. Ensure generator knows which product is being discussed.
Fix 3 (Wrong Chunk): Switch from paragraph-level chunking to semantic chunking. Each "Account Type Y minimum: $X" fact gets its own chunk.
Fix 4 (Citation Accuracy): Add verification step: after generating answer, verify each number/fact against retrieved documents before returning.

Implementation & Results:

Fix 1: 2 hours (remove old documents)
Fix 2: 4 hours (add entity linking)
Fix 3: 3 hours (retesting with semantic chunking)
Fix 4: 6 hours (add verification step)
Total: 15 hours implementation

Results (post-deployment):

Failure rate dropped from 11% → 1.2%
Silent failures (wrong answers, not flagged) dropped from 8% → 0.1%
Customer complaints → 80% reduction in 1 month

Lessons Learned:

The initial hypothesis ("better LLM") was wrong. Diagnosis revealed retrieval/data issues were primary.
Compound failures (stale + conflation + wrong chunk) were worse than any single failure.
Silent failures (wrong with high confidence) are the most harmful. The answer was plausible-sounding.
Systematic diagnosis with the 6-step protocol identified root causes in 30 minutes; would have taken days with trial-and-error.

Detection Methods Reference

Failure Type	Symptom	How to Detect	Metric to Track
1. Missed Retrieval	Unanswerable but docs exist	Manual corpus search for relevant docs	Oracle Recall (% of oracle-relevant docs retrieved)
2. Wrong Chunk	Doc relevant but excerpt wrong	Compare document-level vs. chunk-level eval	Chunk Precision vs. Document Precision
3. Stale Context	Factually wrong, old source cited	Check publication dates of retrieved docs	Average Document Age; Failure rate by document age
4. Hallucinated Source	Wrong or invented citation	Verify all citations exist in corpus	Citation Validity Rate
5. Conflation	False claim mixing sources	Trace multi-source claims; verify relationships	Answer Consistency; Cross-document Accuracy
6. Over-Retrieval Noise	Low precision, high noise	Measure context precision (relevant docs / total)	Context Precision; % Irrelevant Documents
7. Under-Retrieval Gap	Incomplete answers, low recall	Measure context recall	Context Recall; Answer Completeness
8. Embedding Drift	Performance declines over time	Monitor precision/recall trends monthly	Precision/Recall Trend; Embedding Space Coverage
9. Query Reformulation Failure	Reformulated query loses intent	Compare original vs. reformulated retrieval quality	Delta Precision (original vs. reformulated)
10. Reranker Errors	Relevant docs ranked low	Compare retriever vs. reranker ranking	Ranking Correlation; Reranker Accuracy
11. Context Window Overflow	Later docs ignored	Check context token usage vs. window size	% Documents Fully Processed; Token Distribution
12. Citation Fabrication	Citation wrong or made-up	Verify each citation; check claim-source match	Citation Accuracy; Hallucination Rate

The 12 Failure Types — Quick Reference

1. Missed Retrieval: Relevant doc not in results (retrieval)
2. Wrong Chunk: Right doc, wrong section (retrieval)
3. Stale Context: Outdated information (retrieval/data)
4. Hallucinated Source: Wrong or invented citation (generation)
5. Conflation: Mixing unrelated sources (generation)
6. Over-Retrieval Noise: Too many irrelevant docs (retrieval)
7. Under-Retrieval Gap: Too few docs, incomplete answer (retrieval)
8. Embedding Drift: Embedding quality degraded (retrieval/model)
9. Query Reformulation Failure: Bad query expansion (retrieval)
10. Reranker Errors: Reranker ranks wrong (retrieval)
11. Context Window Overflow: Docs cut off (generation)
12. Citation Fabrication: Citation wrong/made-up (generation)

Master Systematic RAG Evaluation

Learn to diagnose and fix RAG systems at scale with the CAEE Level 3 program, covering failure taxonomy, diagnosis protocols, and production incident recovery.

Exam Coming Soon

Introduction: Naming Failure Modes

Retriever Failures (1-9) — Detailed Analysis

Failure 1: Missed Retrieval — Relevant Document Not Retrieved

Failure 2: Wrong Chunk — Right Document, Wrong Section

Failure 3: Stale Context — Outdated Information

Failure 4: Hallucinated Source — Wrong or Invented Citation

Failure 5: Conflation — Incorrectly Mixing Multiple Sources

Failure 6: Over-Retrieval Noise

Failure 7: Under-Retrieval Gap

Failure 8: Embedding Drift

Failure 9: Query Reformulation Failure

Generator Failures (10-12)

Failure 10: Reranker Errors

Failure 11: Context Window Overflow

Failure 12: Citation Fabrication

Compound Failures — When Multiple Failures Cascade

Silent vs. Loud Failures — The Confidence Problem

Failure Diagnosis Protocol — Step-by-Step

RAG Failure Rates by Industry

20-Item Pre-Deployment Quality Gate

Case Study: Diagnosing a Real Production RAG Failure

Detection Methods Reference

The 12 Failure Types — Quick Reference

Master Systematic RAG Evaluation

Related Lessons