Introduction: Naming Failure Modes
When your RAG system produces a bad answer, it's tempting to blame it on "bad retrieval" or "bad generation." But that's too coarse. There are at least 12 distinct failure modes, each with different root causes, different detection methods, and different fixes.
Understanding the taxonomy allows you to diagnose precisely what went wrong. Precision in diagnosis leads to precision in fixes.
Retriever Failures (1-9) — Detailed Analysis
Failure 1: Missed Retrieval — Relevant Document Not Retrieved
Definition: The relevant document exists in your corpus but your retriever doesn't return it. The answer is knowable, but unretrievable.
Root Causes:
- Embedding Space Mismatch: The query embedding is far from the document embedding in vector space
- Query-Document Language Mismatch: Query uses different terminology than documents (synonyms, paraphrases)
- Semantic Distance Too Large: Query intent fundamentally different from how documents express the concept
- Ranking Threshold Too High: Document passes similarity threshold but ranked below cutoff
Real-World Example: A customer support RAG trained on technical documentation fails to retrieve answers about "activation codes" because the documentation calls them "serial number authentication tokens." The embeddings never align.
Detection Methods:
- For every failed answer, manually search your corpus for documents that would have fixed it
- If found, measure the similarity score between query and document
- Compare this "oracle retrieval score" to your actual retrieval threshold
- Analyze the embedding space distance using tools like UMAP to visualize misalignment
- Run query expansion experiments and measure oracle retrieval score improvement
Fixes to Try (in order of effectiveness):
- Query Expansion: Automatically generate synonyms, paraphrases, and related terms (use LLM to expand queries)
- Better Embeddings: Switch to domain-specific embedding model (e.g., for legal or medical)
- Fine-tune Embeddings: Finetune existing embedding model on your domain data with relevance pairs
- Synonym Dictionary: Build domain-specific synonym mapping (technical support: "activation code" → "serial token" → "authentication key")
- Multi-Retrieval Strategy: Retrieve with multiple similarity metrics (cosine, euclidean, semantic) and ensemble results
- Hybrid Search: Combine dense (embedding) + sparse (BM25 keyword) retrieval to catch both semantic and lexical matches
Diagnostic Signals: Context recall is low. Precision might be reasonable. Average similarity score of top-K retrieved documents is lower than expected.
Failure 2: Wrong Chunk — Right Document, Wrong Section
Definition: Your retriever finds a relevant document but returns the wrong section/chunk. The document contains the answer, but in a different part than what was retrieved.
Root Causes:
- Poor Chunking Strategy: Long documents split at content boundaries that break meaning
- Loss of Context Between Chunks: Important context lives in previous chunk, current chunk references it ambiguously
- Embedding at Wrong Granularity: You embed at paragraph level but answer spans multiple paragraphs
- Insufficient Chunk Overlap: Moving window chunking without enough overlap loses connecting sentences
Real-World Example: A legal document retriever splits a contract at natural section breaks. Query: "What happens if I breach the warranty?" The warranty clause is in section 5. The retriever returns section 3 (general conditions) which mentions warranties but not breach consequences. The breach remedies are in section 7.
Detection Methods:
- For failed answers, manually find the document that contains the right answer
- Check if your retriever returned any chunk from that document
- If yes, compare the returned chunk to the ideal chunk
- Analyze: were they from the same document? If yes, this is wrong chunk failure
- Calculate "chunk recall within document" — of documents you retrieved, what % contained the answer?
Fixes to Try (in order of effectiveness):
- Overlapping Chunks: Use sliding window with 50-100 token overlap. This ensures context carries forward
- Semantic Chunking: Split documents at semantic boundaries (using sentence embeddings) rather than fixed sizes
- Hierarchical Chunking: Create chunks at multiple levels (section, paragraph, sentence) and retrieve at multiple granularities
- Context-Preserving Chunks: Append section headers and previous sentence to each chunk for context
- Finer-Grained Embeddings: Embed at sentence level instead of paragraph level for finer retrieval precision
Diagnostic Signals: You retrieve documents that are relevant (high precision on document-level) but chunks within those documents are wrong (low precision on chunk-level).
Failure 3: Stale Context — Outdated Information
Definition: The retrieved document is factually outdated. It was correct when written but facts have changed. The system confidently provides old answers.
Root Causes:
- Corpus Not Updated: Your training data is months or years old
- No Recency Bias: Ranking treats old and new documents equally
- Multi-Versioned Documents: You have both old and new versions in corpus; retriever picks old one
- Time-Sensitive Domain: Price lists, regulations, policy documents change frequently but corpus static
Real-World Example: A finance RAG system answers questions about tax policy. A new tax law passed in 2025, but the corpus still contains 2023 regulations. The system retrieves the old regulation and confidently provides advice that violates current law.
Detection Methods:
- Implement temporal tagging — every document has publication/update date
- For failed answers, check publication date of retrieved documents
- Correlate answer errors with publication date gaps
- Build a time-series eval set: same question answered on multiple dates, comparing to ground truth by date
- Track which domains have high failure rate for time-sensitive queries
Fixes to Try (in order of effectiveness):
- Update Corpus Regularly: Establish cadence for re-embedding and re-indexing (weekly for fast-moving domains)
- Recency Ranking: Weight retrieval by document freshness. Recent documents get score boost
- Version Management: Identify and remove old versions of documents automatically
- Time-Window Restriction: For time-sensitive queries, only retrieve documents published after cutoff date
- Live Data Integration: For highly time-sensitive data, skip retrieval and fetch from live systems (APIs, databases)
- Deprecation Notices: Mark old documents as deprecated; retriever deprioritizes them but keeps for historical queries
Diagnostic Signals: Errors correlate with time. Recent questions answered correctly; older questions answered with outdated facts. Retrieved documents consistently predate known fact changes.
Failure 4: Hallucinated Source — Wrong or Invented Citation
Definition: The generator cites a specific document that doesn't support the claim, or invents a source that doesn't exist in your corpus.
Root Causes:
- Pattern Matching From Training Data: LLM learned to cite like it saw in training, not actually reading retrieved context
- Overconfidence: Generator makes up plausible citations because it's confident in the answer
- Ambiguous Context: Retrieved context is loosely related; generator extrapolates and confabulates link
- Prompt Incompletely Enforces Citation: Your prompt says "cite sources" but doesn't verify or penalize hallucination
Real-World Example: Query: "What year was GPT-4 released?" Generator answers: "GPT-4 was released in March 2023 according to the OpenAI press release #47." But your corpus has no document with ID #47. The generator invented the citation.
Detection Methods:
- Citation Verification: For every answer with a citation, verify the cited document actually exists in your corpus
- Claim Matching: Extract claims from answer and search for supporting text in cited document
- Automated Citation Auditing: Use an LLM to evaluate: "Does this citation actually support the claim?" on sample
- Citation Drift Monitoring: Track % of answers with valid citations over time
- Impossible Citation Detection: Flag citations with format/ID that doesn't exist in corpus
Fixes to Try (in order of effectiveness):
- Retrieval-Guided Generation: Require generator to cite directly from retrieved context; use templates like "[Claim] (Document X, line Y)"
- Citation Verification Step: Add explicit verification step: before returning answer, verify each citation against corpus
- Constrained Generation: Force LLM to only cite from provided documents using constrained decoding
- Prompt Reinforcement: Add penalties in prompt: "If you cite a source, it must be from the provided documents. Do not hallucinate sources."
- Fine-tuning for Accuracy: Fine-tune LLM on examples where citations are verified correct
Diagnostic Signals: Citations don't match documents or contain impossible document IDs. High percentage of answers cite sources; low percentage of those citations are actually valid.
Failure 5: Conflation — Incorrectly Mixing Multiple Sources
Definition: The retriever brings back two unrelated documents, and the generator incorrectly combines them into a false claim.
Root Causes:
- Entity Ambiguity: Same entity name in different documents with different properties (e.g., "Apple Inc." vs. "apple the fruit")
- Generator Scope Confusion: Generator doesn't track which claims come from which documents
- Retrieved Noise: Irrelevant documents retrieved alongside relevant ones; generator combines them
- Implicit Relationships: Generator assumes relationship between two documents that isn't explicitly stated
Real-World Example: Query: "What is the capital of Australia?" Retriever returns: (1) "Canberra is the capital of Australia" + (2) "Sydney is Australia's largest city." Generator conflates these: "The capital of Australia is Sydney Canberra" or "Sydney is the capital, which is Australia's largest city."
Detection Methods:
- For false claims in answers, trace the claim back to retrieved sources
- Check if the claim requires combining information from multiple documents
- Analyze: is the combination explicitly supported or implied/inferred?
- Compare properties across documents for same entity; flag contradictions
- Build a "conflation dataset" of false claims from real failures and audit them
Fixes to Try (in order of effectiveness):
- Entity Linking: Resolve entities to canonical IDs; system knows when "Apple Inc." != "apple"
- Source Attribution in Prompts: Explicitly require LLM to tag each claim with source document: "[Fact1 - Doc A] [Fact2 - Doc B]"
- Scope Checking: Add prompt: "Do not combine facts from different documents unless explicitly related"
- Retriever Denoising: Use reranker to remove noisy irrelevant documents before generation
- Fact Verification: Post-generation verification: for multi-source claims, verify relationship actually exists
Diagnostic Signals: False claims that are "locally true" — each component comes from a document, but the combination is false. Answers cite multiple documents for a single claim.
Failure 6: Over-Retrieval Noise
Definition: Your retriever returns many documents, but most are noise. The relevant information is buried under irrelevant context.
Root Causes:
- Loose Threshold: You retrieve top-K documents where K is large or similarity threshold is too low
- Broad Query Matching: Query matches many documents tangentially but none deeply
- Corpus Quality Issues: Your corpus contains duplicate or near-duplicate documents that all get retrieved
- Retrieval Strategy Too Permissive: You use OR-based logic when AND-based would be tighter
Real-World Example: Query: "How do I reset my password?" Retriever returns 50 support articles about "account management" which mention passwords in passing. Only 5 of the 50 directly address password reset. The generator is overwhelmed by noise.
Detection Methods:
- Calculate Context Precision: Of top-K retrieved documents, what % are relevant to the query?
- Measure Average Rank Position: Of retrieved relevant documents, what position do they appear? (Lower is better)
- Audit false negatives: for wrong answers, check if relevant doc was retrieved but ranked low
- Sample random answers and manually rate retrieved context for relevance
- Compare generator output with/without top-K documents to measure noise impact
Fixes to Try (in order of effectiveness):
- Tighter Retrieval: Reduce K or raise similarity threshold; retrieve fewer documents
- Better Ranking: Add a reranker (cross-encoder) that re-ranks retrieved documents
- Query Refinement: Use LLM to refine query before retrieval (more specific → tighter results)
- Multi-Stage Retrieval: Retrieve large pool, coarse-rank, then fine-rank for top-K
- Deduplication: Remove near-duplicate documents from retrieval results
Diagnostic Signals: Context precision is low. Relevant documents are retrieved but mixed with many irrelevant ones. Generator struggles to extract signal from noise.
Failure 7: Under-Retrieval Gap
Definition: You retrieve too few documents, missing relevant information that would have completed the answer.
Root Causes:
- Conservative Retrieval: You retrieve top-K where K is small or similarity threshold is very high
- Insufficient Coverage: Single document doesn't contain full answer; you need multiple documents
- Information Scattered: Answer requires combining information from documents; you retrieve too few to synthesize
- Edge Case Sparsity: Query addresses rare scenario; fewer documents cover it
Real-World Example: Query: "What are the side effects, contraindications, and interactions of drug X?" You retrieve only the main drug information document but miss the separate pharmacovigilance report and the drug-interaction database entries. Answer is incomplete.
Detection Methods:
- Calculate Context Recall: Of all documents that contain relevant information, what % did you retrieve?
- Measure Answer Completeness: Of key facts needed to answer fully, how many are in retrieved context?
- For incomplete answers, manually search corpus: do additional documents exist that would fill gaps?
- Compare answer quality with retrieval count: does quality improve with more retrieved documents?
- Analyze query complexity: multi-hop, multi-document queries show lower completion rates
Fixes to Try (in order of effectiveness):
- Increase Retrieval Count: Retrieve more documents (increase K)
- Lower Threshold: Lower similarity threshold to include marginal documents
- Multi-Hop Retrieval: Retrieve initial results; use LLM to identify missing information; retrieve again
- Iterative Refinement: Ask generator "what additional information would help?" and retrieve based on answer
- Query Decomposition: Break complex query into sub-queries; retrieve for each independently
Diagnostic Signals: Context recall is low. Answers are incomplete or vague. Generator often says "information not provided in context."
Failure 8: Embedding Drift
Definition: Your embedding model has degraded or the training distribution has drifted. Queries and documents no longer align well in embedding space.
Root Causes:
- Domain Evolution: Your domain vocabulary evolved but embeddings trained on static data
- Corpus Expansion: New documents added with different vocabulary/style than originals
- Query Language Shift: Users asking questions in new ways not seen in training
- Catastrophic Embedding Forgetting: If you retrained/finetuned embeddings, new model misaligns with old corpus
Real-World Example: Medical RAG system trained embeddings 2 years ago on academic papers. During COVID-19, users asked about vaccine efficacy, long COVID, variants — terminology the embeddings never saw. Retrieval quality degraded 40% in 6 months.
Detection Methods:
- Embedding Drift Monitoring: Track average retrieval precision/recall over time
- Embedding Quality Metrics: Compare new embeddings to old on held-out queries
- Vocabulary Coverage: Analyze new queries/documents; % of tokens not in embedding training data
- Vector Space Analysis: Use UMAP to visualize embedding space; check for clustering degradation
- Comparison Testing: Baseline old embedding model vs. new on same eval set
Fixes to Try (in order of effectiveness):
- Retrain Embeddings: Periodically retrain on current corpus + current query distribution
- Fine-Tune Embeddings: Fine-tune existing embedding model on your domain data with human relevance labels
- Upgrade Embedding Model: Switch to newer embedding model (e.g., from sentence-transformers) trained on more recent data
- Continuous Retraining: Auto-retrain embeddings monthly with new data
- Domain-Specific Embeddings: Use embeddings trained specifically for your domain (legal, medical, finance-specific models)
Diagnostic Signals: Performance degradation over time. Queries that worked 3 months ago now fail. New domains/vocabularies have lower retrieval quality than established ones.
Failure 9: Query Reformulation Failure
Definition: Your system reformulates the user query to improve retrieval, but the reformulation makes things worse.
Root Causes:
- Reformulation Model Confusion: LLM used to reformulate misunderstands original intent
- Over-Generalization: Reformulation broadens query too much (too permissive)
- Over-Specialization: Reformulation narrows query too much (too restrictive)
- Paraphrase Failure: Attempted paraphrase introduces new query intent not in original
Real-World Example: User query: "What's the difference between machine learning and deep learning?" Reformulation: "Define machine learning. Define deep learning." Now you have two separate queries, losing the comparative intent. Results discuss each separately but don't compare.
Detection Methods:
- Log both original and reformulated queries for every request
- For failed answers, compare: did original query work better than reformulated?
- Manually audit reformulations for semantic equivalence to original
- Measure retrieval quality: original query embeddings vs. reformulated
- Test: does retrieval improve with reformulation (on oracle eval set)
Fixes to Try (in order of effectiveness):
- Better Reformulation Model: Use more capable LLM for reformulation (e.g., GPT-4 instead of GPT-3.5)
- Reformulation Validation: Verify reformulated query retrieves better than original before using it
- Query Expansion Over Reformulation: Instead of rewriting query, expand it (keep original + add synonyms)
- Ensemble Multiple Reformulations: Generate 3-5 reformulations; retrieve with all; ensemble results
- Disable Reformulation: Not all queries benefit; use original query for specific query types
Diagnostic Signals: Reformulated queries have lower retrieval quality than original. Reformulations change fundamental intent of question. Queries about comparisons get split into separate queries.
Generator Failures (10-12)
Failure 10: Reranker Errors
Definition: Your reranker (if you have one) ranks relevant documents low, causing them to be dropped before generation.
Root Causes:
- Relevance Mismatch: Reranker trained on different definition of relevance than you need
- Domain Shift: Reranker trained on general web data; you're using it on specialized domain
- Insufficient Training Data: Reranker underfits; no labeled training data from your domain
- Threshold Miscalibration: You set cutoff score too high; relevant documents dropped
Real-World Example: You use a generic reranker trained on MS-MARCO data. For medical queries, it downranks documents with "discuss with your doctor" (appropriate caveating) while upranking confident assertions. Its relevance definition is medical-unsafe.
Detection Methods:
- Compare retriever ranking vs. reranker ranking on eval set
- For failed answers: were relevant documents ranked higher by retriever than reranker?
- Measure reranker accuracy on domain-specific test set
- Analyze: reranker score distribution; check for bias toward/against certain document types
Fixes to Try (in order of effectiveness):
- Domain-Specific Reranker: Fine-tune reranker on your domain (requires labeled relevance data)
- Retrain Reranker: If your relevance definition changed, retrain from scratch
- Lower Threshold: Increase number of documents passed to generator
- No Reranking: For some domains, better to skip reranking and increase retrieval count
- Multi-Stage Ranking: Use multiple rerankers with different strengths; ensemble their scores
Diagnostic Signals: Retriever quality is good; generator quality worse than expected. Relevant documents retrieved but ranked low by reranker.
Failure 11: Context Window Overflow
Definition: Your retrieved context is so large that relevant information gets cut off or pushed out of the context window, becoming unavailable to generator.
Root Causes:
- Too Many Documents Retrieved: You pass 50 documents to LLM with 4K context window
- Small Context Window: Using model with small context limit relative to your corpus size
- Long Document Format: Your chunks are very large; a few chunks fill context window
- Inefficient Context Format: You include metadata, headers, formatting that takes tokens
Real-World Example: You retrieve 20 documents (average 500 tokens each) = 10K tokens. Your LLM has 4K context window. Only first 3-4 documents fit. Generator never sees documents 5-20, even if they're more relevant.
Detection Methods:
- Monitor context token count vs. context window size
- Track what % of retrieved context actually used by generator
- For failed answers: were relevant documents in part of context that got truncated?
- Analyze token distribution: first-K documents vs. remaining documents in retrieved context
Fixes to Try (in order of effectiveness):
- Reduce Retrieval Count: Retrieve fewer documents; prioritize quality over quantity
- Upgrade Model Context: Use LLM with larger context window (e.g., Claude 200K vs. 4K)
- Compress Context: Compress retrieved documents before passing to generator (abstractive or extractive summarization)
- Better Ranking: Ensure top-ranked documents are truly most relevant; eliminate less important ones
- Hierarchical Context: Pass full documents for top-3 results; abstracts/summaries for documents 4-10
Diagnostic Signals: Context token count consistently near or exceeds context window. Generator output quality improves when fewer documents provided. Last retrieved documents never contribute to output.
Failure 12: Citation Fabrication
Definition: The generator produces an answer that's correct, but cites a source that doesn't actually support the claim, or invents a source that doesn't exist in your corpus.
Root Causes:
- Pattern Matching From Training: LLM learned to cite from pre-training data, not from provided context
- Confidence Overrides Accuracy: Generator is confident enough to cite, even though it's making up the citation
- Lost Context: Generator forgets what was actually provided vs. what it inferred
- Subtle Citation Drift: Citation approximately correct but details wrong (minor hallucination)
Real-World Example: Query: "How many acres is Central Park?" Correct answer: 843 acres. Retrieved document says 843. Generator says: "843 acres, per NYC Parks Department Report 2023." But your corpus has no report with that title — generator invented it.
Detection Methods:
- For every citation, verify document actually exists in your corpus
- Verify the cited text actually supports the claim (automated or human)
- Measure citation accuracy rate over sample of answers
- Flag impossible citations (wrong format, non-existent IDs)
- Check: is the citation necessary for the claim, or just cosmetic?
Fixes to Try (in order of effectiveness):
- Retrieval-Guided Generation: Force generator to cite only from provided documents using special tokens
- Post-Generation Verification: After generating answer, verify each citation before returning
- Constrained Decoding: Use vLLM or similar to constrain citations to valid document IDs
- Fine-Tuning for Accuracy: Fine-tune LLM on examples with verified correct citations
- No-Citation Option: Allow "I don't know" answers without citation rather than hallucinated citations
Diagnostic Signals: Citations exist but don't actually support claims. Citation IDs/names don't match documents in corpus. Answer is correct but citations are incorrect/made-up.
Compound Failures — When Multiple Failures Cascade
Why Compound Failures Matter: Most real-world RAG failures aren't single-mode. They're combinations. And the impact isn't additive — it's multiplicative.
Failure Combination Patterns:
Pattern A: Missed Retrieval + Hallucination
Relevant document isn't retrieved (Failure 1). Generator, lacking context, hallucinates an answer with fabricated citation (Failure 12). Result: Wrong answer, confident citation. System is very broken.
Example: User: "What's the policy on employee remote work?" Retrieval misses the remote work policy document. Generator: "You can work remotely 2 days/week per the Employee Handbook 2025" (fabricated — no such policy exists). User believes this and works remotely; gets reprimanded.
Pattern B: Over-Retrieval Noise + Conflation
Retriever returns 30 documents, most irrelevant (Failure 6). Generator mixes unrelated documents into a false claim (Failure 5). Result: Confident, incorrect synthesis.
Example: Query: "What are the risks of AI adoption?" Retrieval returns: AI papers, risk management papers, adoption case studies. Generator conflates: "AI adoption has the same risks as traditional technology adoption (low)" + "AI has existential risks (high)" into: "AI adoption has existential risks."
Pattern C: Wrong Chunk + Stale Context
Retriever finds document but wrong chunk (Failure 2). That chunk is from 2022 (Failure 3). Result: Confidently wrong outdated information from the right document.
Example: Query: "What's the current price of API access?" Document about pricing exists (2022 version retrieved). Retrieved chunk: "API costs $10/1M tokens." Today's price is $5. User bases pricing model on outdated data.
Pattern D: Under-Retrieval Gap + Incomplete Generation
Too few documents retrieved (Failure 7). Generator can't synthesize complete answer. Result: Incomplete but confident answer that sounds complete.
Example: Medical query requires information from drug safety database + patient population study + clinical trial. You only retrieve the safety database. Answer is incomplete about efficacy in specific populations but doesn't signal that.
Detecting Compound Failures:
- For each failed answer, don't stop at finding one failure mode
- Ask: Could multiple issues have contributed?
- Trace answer backwards: Was information missing (retrieval issue)? Was it misinterpreted (generation issue)? Both?
- Build failure pattern matrix: track co-occurrence of failure types
- Test fixes: if you fix Failure 1 and answer still fails, look for Failure 2
Fixing Compound Failures:
Fix the highest-impact component first. Usually: (1) Retrieval issues first (can't generate good answer from missing context), (2) Generation issues second (assuming good retrieval).
Silent vs. Loud Failures — The Confidence Problem
Silent Failures: System produces a wrong answer with high confidence. User doesn't know it's wrong. This is the most dangerous failure type.
Loud Failures: System produces an error message or explicitly says "I don't know." User knows they can't rely on the answer.
The Asymmetry: A loud failure is better than a silent failure. Users can work around loud failures. Silent failures erode trust invisibly.
Examples:
- Silent Failure: Query: "What's the dosage for drug X?" System confidently provides wrong dosage. Patient takes overdose.
- Loud Failure: Query: "What's the dosage for drug X?" System: "I don't have reliable information about dosages. Please consult your pharmacist."
Silent Failure Causes:
- Hallucinated answers that sound plausible
- Correct answers to wrong interpretations of the question
- Outdated information presented as current
- Wrong chunk that's plausible but incorrect
Detecting Silent Failures:
- Use LLM-as-judge to evaluate answer quality, not just accuracy
- Look for confidence signals: hedging language, caveats, "I'm not sure" indicates loud failure (good). Absolute statements indicate silent failure risk
- Measure: what % of wrong answers were presented with high confidence?
- For high-stakes domains (medical, legal, financial), sample answers and manually verify
Converting Silent to Loud Failures:
- Confidence Scoring: Add mechanism that estimates confidence in answer; return "I'm not confident" for low-confidence answers
- Verification Step: Before returning answer, verify it against retrieved context; flag low-confidence answers
- Fallback Behavior: For low-confidence answers, provide "I'm not sure, but here's what I found" instead of confident wrong answer
- Uncertainty Quantification: Use Bayesian methods or ensemble diversity to estimate answer uncertainty
- Explicit Disclaimers: For high-risk domains, always include: "This is information only, not professional advice"
Failure Diagnosis Protocol — Step-by-Step
When a RAG system produces a wrong answer, use this protocol to diagnose root cause:
Step 1: Verify the Failure (2 min)
- Is the answer actually wrong? (Not: is it incomplete or could be better?)
- Confirm against ground truth / oracle / domain expert
- Document: the query, the answer, why it's wrong
Step 2: Analyze Retrieval (5 min)
- What documents were retrieved?
- Are they relevant to the query? (Document-level evaluation)
- Do they contain information that would have produced a correct answer?
- If relevant doc not in top-10: Failure 1 (Missed Retrieval)
- If relevant doc in results but not in top-K passed to generator: Failures 6-7 (Over/Under-retrieval) or 10 (Reranker)
- Check document publication dates: old documents retrieved? Failure 3 (Stale Context)
Step 3: Analyze Context Quality (5 min)
- Is the relevant information in the retrieved context presented clearly?
- Is the right chunk included or just the document?
- Is context missing crucial information? Failure 2 (Wrong Chunk) or 7 (Under-Retrieval)
- Is context unclear or ambiguous? Might cause conflation (Failure 5)
- Is the context window too small relative to retrieved documents? Failure 11 (Context Window Overflow)
Step 4: Analyze Generation (5 min)
- If retrieval is good, why did generation fail?
- Does the answer match any retrieved information? Or is it hallucinated? Failures 4, 12 (Hallucination, Citation Fabrication)
- Does the answer incorrectly combine information from multiple documents? Failure 5 (Conflation)
- Are citations valid? Failure 4, 12 (Hallucinated/Fabricated Sources)
- Does the answer ignore retrieved context and rely on pre-training? Failure 12 (Citation Fabrication)
Step 5: Root Cause Classification (3 min)
- Based on Steps 2-4, identify which failure mode(s)
- Prioritize: is it primarily a retrieval issue or generation issue?
- Document the root cause with evidence
Step 6: Design Fix (10 min)
- Based on failure mode, what fixes to try (in priority order)?
- Identify what metrics would validate the fix works
- Plan experiment: would the fix have prevented this failure?
Time Budget: 30 minutes per failure diagnosis
RAG Failure Rates by Industry
Research from deployed RAG systems shows typical failure rate distributions. Your rates may differ, but these provide a baseline for what's "normal."
| Industry | Typical Failure Rate | Most Common Mode | Second Most Common | Recommendations |
|---|---|---|---|---|
| Healthcare/Medical | 8-12% | Hallucination (45%) | Missed Retrieval (28%) | Strict verification; domain embeddings; citation requirements |
| Legal | 6-10% | Stale Context (38%) | Wrong Chunk (24%) | Frequent corpus updates; semantic chunking; recency bias |
| Finance/Banking | 5-8% | Conflation (35%) | Stale Context (30%) | Entity linking; daily updates; source separation |
| Customer Support | 12-18% | Missed Retrieval (40%) | Over-Retrieval Noise (25%) | Query expansion; user feedback loop; frequent reranking training |
| Technical Documentation | 4-7% | Stale Context (42%) | Wrong Chunk (22%) | Versioning; timestamp tracking; update notifications |
| E-Commerce Product Info | 10-15% | Stale Context (38%) | Missed Retrieval (32%) | Real-time catalog sync; inventory updates; query expansion |
20-Item Pre-Deployment Quality Gate
Before deploying your RAG system to production, verify these 20 items. Each should have a passing test or documented exception.
Retrieval Quality (7 items):
1. Retrieval precision ≥ 0.85 on test set (documents are relevant)
2. Retrieval recall ≥ 0.80 on test set (relevant documents are found)
3. Embedding model tested on domain data; not just off-the-shelf
4. Query expansion tested; improves recall without hurting precision
5. Chunking strategy validated; chunks are cohesive and don't break context
6. Stale document detection implemented or corpus freshness verified
7. Retrieval latency acceptable for use case (goal depends on application)
Generation Quality (6 items):
8. Answer accuracy ≥ 95% on high-risk queries (medical, legal, financial)
9. Citation accuracy ≥ 95% (if citations provided)
10. No hallucinated sources detected in sample of 100 answers
11. Conflation test passed: system doesn't incorrectly mix documents
12. Confidence signals appropriate: wrong answers flagged as uncertain
13. Length of answers reasonable (not too long/short for context window)
Robustness (4 items):
14. Edge cases tested: ambiguous queries, out-of-domain queries, adversarial inputs
15. Multi-language support (if applicable): tested for your languages
16. Empty/error handling: system degrades gracefully when retrieval fails
17. Performance under load tested: latency acceptable at 10x expected QPS
Monitoring (3 items):
18. Logging implemented: every query logged with retrieval results and generation output
19. Metrics dashboard created: track precision, recall, answer accuracy, latency
20. Incident response plan written: what to do if failure rate spikes
Case Study: Diagnosing a Real Production RAG Failure
The Incident:
A financial services company deployed a RAG system to help employees answer customer questions about investment products. After 2 weeks in production, they noticed:
- 11% of generated answers were factually wrong
- When wrong, answers were presented confidently (silent failures)
- Wrong answers mostly about product fees and minimum investment amounts
- Customer complaints about being given incorrect information
Initial Hypothesis: "The generation model is hallucinating. We need a better LLM."
Diagnosis Process (30-minute investigation):
Step 1: Verify Failure (done — documented 50 wrong answers)
Step 2: Analyze Retrieval
Sample 10 failed answers. For each, check what was retrieved:
- Failure case 1: Query "What's the minimum investment for Fund X?" Retrieved documents: Fund X overview (correct), competitor comparison, general investing guide. Fund X minimum was in the overview.
- Failure case 2: Query "What fees apply to Account Type Y?" Retrieved: Account Type Y overview, fee schedule (from 2023), new fee schedule (from 2024). Both retrieved.
- Failure case 3-10: Similar — relevant documents retrieved.
Finding: Retrieval is working. Relevant documents are in the context. This isn't Failure 1 (Missed Retrieval).
Step 3: Analyze Context Quality
For Failure case 2: Generated answer said "Account fees are 0.5% annually per the 2023 fee schedule." But 2024 schedule shows 0.35%. Both documents retrieved, but generator used old one. The retrieved documents say different things.
Finding: Failure 3 (Stale Context). The corpus contains both old and new versions of documents. Retriever returned both. Generator picked the wrong one.
Step 4: Analyze Generation
Looking at generation prompts: System was told "Use provided documents to answer." For Failure case 2, both fee documents provided. Generator should have said "As of 2024, fees are 0.35%" but instead cited 2023.
Check: Did the generator misread the documents, or pick wrong one intentionally?
Manual inspection of 10 cases:
- 5 cases: Generator used older date (2023-2024 discrepancy)
- 3 cases: Generator misread minimum amounts (e.g., "$10,000 minimum" read as "$1,000")
- 2 cases: Generator conflated similar products (two Fund X variants mixed)
Step 5: Root Cause Classification
Primary causes (in order):
- Failure 3 (Stale Context) — 50% of failures: Multiple versions of documents in corpus; generator picks old version
- Failure 5 (Conflation) — 20% of failures: Similar products retrieved together; generator confuses them
- Failure 2 (Wrong Chunk) — 20% of failures: Correct document retrieved but wrong section (e.g., minimum for account type A vs. B)
- Failure 12 (Citation Fabrication) — 10% of failures: Generator cited non-existent fee schedule
Step 6: Design Fixes
In priority order:
- Fix 1 (Stale Context): Remove 2023 fee schedules from corpus; keep only current + archive. Mark archive versions as "historical only." Implement date-based retrieval filtering.
- Fix 2 (Conflation): Add entity linking for product IDs. When query mentions "Fund X," tag it with canonical product ID. Ensure generator knows which product is being discussed.
- Fix 3 (Wrong Chunk): Switch from paragraph-level chunking to semantic chunking. Each "Account Type Y minimum: $X" fact gets its own chunk.
- Fix 4 (Citation Accuracy): Add verification step: after generating answer, verify each number/fact against retrieved documents before returning.
Implementation & Results:
- Fix 1: 2 hours (remove old documents)
- Fix 2: 4 hours (add entity linking)
- Fix 3: 3 hours (retesting with semantic chunking)
- Fix 4: 6 hours (add verification step)
- Total: 15 hours implementation
Results (post-deployment):
- Failure rate dropped from 11% → 1.2%
- Silent failures (wrong answers, not flagged) dropped from 8% → 0.1%
- Customer complaints → 80% reduction in 1 month
Lessons Learned:
- The initial hypothesis ("better LLM") was wrong. Diagnosis revealed retrieval/data issues were primary.
- Compound failures (stale + conflation + wrong chunk) were worse than any single failure.
- Silent failures (wrong with high confidence) are the most harmful. The answer was plausible-sounding.
- Systematic diagnosis with the 6-step protocol identified root causes in 30 minutes; would have taken days with trial-and-error.
Detection Methods Reference
| Failure Type | Symptom | How to Detect | Metric to Track |
|---|---|---|---|
| 1. Missed Retrieval | Unanswerable but docs exist | Manual corpus search for relevant docs | Oracle Recall (% of oracle-relevant docs retrieved) |
| 2. Wrong Chunk | Doc relevant but excerpt wrong | Compare document-level vs. chunk-level eval | Chunk Precision vs. Document Precision |
| 3. Stale Context | Factually wrong, old source cited | Check publication dates of retrieved docs | Average Document Age; Failure rate by document age |
| 4. Hallucinated Source | Wrong or invented citation | Verify all citations exist in corpus | Citation Validity Rate |
| 5. Conflation | False claim mixing sources | Trace multi-source claims; verify relationships | Answer Consistency; Cross-document Accuracy |
| 6. Over-Retrieval Noise | Low precision, high noise | Measure context precision (relevant docs / total) | Context Precision; % Irrelevant Documents |
| 7. Under-Retrieval Gap | Incomplete answers, low recall | Measure context recall | Context Recall; Answer Completeness |
| 8. Embedding Drift | Performance declines over time | Monitor precision/recall trends monthly | Precision/Recall Trend; Embedding Space Coverage |
| 9. Query Reformulation Failure | Reformulated query loses intent | Compare original vs. reformulated retrieval quality | Delta Precision (original vs. reformulated) |
| 10. Reranker Errors | Relevant docs ranked low | Compare retriever vs. reranker ranking | Ranking Correlation; Reranker Accuracy |
| 11. Context Window Overflow | Later docs ignored | Check context token usage vs. window size | % Documents Fully Processed; Token Distribution |
| 12. Citation Fabrication | Citation wrong or made-up | Verify each citation; check claim-source match | Citation Accuracy; Hallucination Rate |
The 12 Failure Types — Quick Reference
- 1. Missed Retrieval: Relevant doc not in results (retrieval)
- 2. Wrong Chunk: Right doc, wrong section (retrieval)
- 3. Stale Context: Outdated information (retrieval/data)
- 4. Hallucinated Source: Wrong or invented citation (generation)
- 5. Conflation: Mixing unrelated sources (generation)
- 6. Over-Retrieval Noise: Too many irrelevant docs (retrieval)
- 7. Under-Retrieval Gap: Too few docs, incomplete answer (retrieval)
- 8. Embedding Drift: Embedding quality degraded (retrieval/model)
- 9. Query Reformulation Failure: Bad query expansion (retrieval)
- 10. Reranker Errors: Reranker ranks wrong (retrieval)
- 11. Context Window Overflow: Docs cut off (generation)
- 12. Citation Fabrication: Citation wrong/made-up (generation)
Master Systematic RAG Evaluation
Learn to diagnose and fix RAG systems at scale with the CAEE Level 3 program, covering failure taxonomy, diagnosis protocols, and production incident recovery.
Exam Coming Soon