Level 3

Multi-Hop RAG Evaluation: Assessing Systems That Synthesize Across Multiple Documents

Published Feb 19, 2026 • 20 min read

Guide

Table of Contents

Introduction: Why Multi-Hop Matters
Defining Multi-Hop Reasoning
The Multi-Hop Challenge
Multi-Hop Specific Metrics
The 6-Step Evaluation Protocol
HotpotQA for Production
Graph-Based Reasoning Evaluation

Introduction: Why Multi-Hop Matters

Many questions cannot be answered by retrieving and synthesizing a single document. Instead, they require combining information from multiple documents in a specific sequence. Example: "What is the capital of the country where the inventor of the X-ray was born?" This requires chaining: find inventor → find birthplace → find capital of that place.

Multi-hop RAG evaluation is the discipline of assessing whether your system can handle these chained-reasoning questions. This is harder than single-hop RAG because failures can occur at multiple stages: maybe retrieval fails at step 2, or generation fails at the synthesis step.

This guide teaches you to systematically evaluate multi-hop reasoning and identify where your system breaks down.

Defining Multi-Hop Reasoning

Multi-hop reasoning is reasoning that requires more than one "hop" through the document graph. Hops are document-to-document connections:

Hop 1: Retrieve documents mentioning the X-ray inventor
Hop 2: Retrieve documents about the inventor's birthplace
Hop 3: Retrieve documents about capitals of that country
Answer synthesis: Combine to answer the original question

In production, multi-hop questions are common. "What are the most common side effects of drugs that treat the condition mentioned in this patient's diagnosis?" requires multi-hop reasoning.

The Multi-Hop Challenge

Multi-hop evaluation is harder than single-hop evaluation for three reasons:

1. Error Amplification

In single-hop RAG, your error rate is roughly equal to your retrieval error + generation error. In multi-hop RAG, errors compound. If each hop has 85% success, a 3-hop question has only (0.85)³ = 61% chance of success overall. This is why multi-hop systems often underperform.

2. Intermediate Step Evaluation

With single-hop, you only evaluate the final answer. With multi-hop, you must evaluate each intermediate step. Did the retriever correctly identify the X-ray inventor? Did it then retrieve relevant information about that inventor's birthplace? These intermediate evaluations are tedious to construct.

3. Reasoning Chain Transparency

For evaluation, you need to understand what reasoning the system used. Did it correctly identify the causal chain? Or did it guess? Single-hop systems can be evaluated by answer alone. Multi-hop systems need visible reasoning chains.

Multi-Hop Specific Metrics

Chain Accuracy

Did the system identify the correct reasoning chain? Example: for "What is the capital of the country where the X-ray inventor was born?", the correct chain is:

[X-ray inventor: Wilhelm Röntgen] → [Born: Germany] → [Capital: Berlin]

If your system produces the right answer (Berlin) but via wrong chain (e.g., it looked up something about Einstein instead), chain accuracy = 0 even though final answer is correct.

Measure: does the reasoning path match the expected path? Score: 1 = correct chain, 0 = wrong or hallucinated chain.

Intermediate Hop Accuracy

For each hop in the chain, measure whether the intermediate result is correct:

Hop 1 Accuracy: Did it correctly identify the X-ray inventor? (Score: 1 or 0)
Hop 2 Accuracy: Did it correctly identify their birthplace? (Score: 1 or 0)
Hop 3 Accuracy: Did it correctly identify the capital? (Score: 1 or 0)

This granularity shows which hops fail most frequently. If Hop 2 has 60% accuracy while Hop 1 and 3 have 90%+, you know where to focus.

Answer Faithfulness Under Incomplete Reasoning

Sometimes the system gets the right answer despite reasoning flaws. Measure whether the final answer is faithful to what the reasoning should have produced, even if reasoning itself was flawed.

Score: 1 = answer matches what should follow from reasoning, 0 = answer doesn't match reasoning, N/A = reasoning too garbled to evaluate.

The 6-Step Evaluation Protocol

Step 1: Construct Multi-Hop Test Set

Create 30-50 multi-hop questions with 2-4 required hops. Example format:

Q: "In what year did the Nobel Prize laureate who discovered the structure of DNA have their first published work?"

Ground Truth Reasoning Chain:
- Hop 1: Nobel Prize + DNA structure → James Watson (or Francis Crick, Rosalind Franklin)
- Hop 2: First publication of James Watson → [find earliest paper]
- Hop 3: Publication year → [get year]
Answer: 1950 (or whatever the actual year was)

Required Documents:
- Doc 1: Information about Nobel Prize laureates and DNA
- Doc 2: Information about Watson's early publications
- Doc 3: Publication dates/timeline

Step 2: Identify Gold Standard Reasoning Chains

For each question, manually construct the correct reasoning chain. What documents must be retrieved, and in what order? How does information connect across documents?

Have 2-3 domain experts review each chain to ensure it's canonical.

Step 3: Run System and Capture Intermediate Steps

Configure your RAG system to produce not just the final answer, but also:

Each retrieval query generated
Documents retrieved per hop
Intermediate answers/reasoning before final synthesis

This is critical for understanding where the system failed.

Step 4: Evaluate Chain Correctness

For each system output, evaluate:

Does the reasoning chain match the gold standard? (Score: 1 = yes, 0 = no)
Does each hop retrieve relevant documents? (Score: hop accuracy, 0-100%)
Are intermediate conclusions correct? (Score per hop: correct, partially correct, or wrong)

Step 5: Evaluate Final Answer

Measure whether the final answer is correct, independent of reasoning chain:

Exact match: answer matches gold standard exactly (Score: 1)
Partial credit: answer is mostly correct but missing nuance (Score: 0.5-0.9)
Wrong: answer contradicts gold standard (Score: 0)

Step 6: Calculate Multi-Hop Metrics

Chain Accuracy = (Correct chains) / (Total questions)
Hop-1 Accuracy = (Correct Hop 1 results) / (Total questions)
Hop-2 Accuracy = (Correct Hop 2 results) / (Total questions)
...
Overall Answer Accuracy = (Correct final answers) / (Total questions)

Multi-Hop System Quality = (Chain Accuracy * 0.3) + (Avg Hop Accuracy * 0.4) + (Answer Accuracy * 0.3)

This weighted metric captures whether the system not only gets the right answer, but via the right reasoning.

HotpotQA for Production

HotpotQA is a benchmark specifically designed for multi-hop QA evaluation. It contains 113K Wikipedia-based questions requiring 2-hop reasoning over 10+ retrieved documents.

Adapting HotpotQA for Your Domain

Rather than using Wikipedia facts, create a domain-adapted version:

Identify 2-3 core entity types in your domain (e.g., "drug", "disease", "treatment")
Create questions that require hopping across these entity types
Ensure each question requires at least 2 documents to answer correctly
Create supporting document sets

Example (Medical Domain):

Q: "What are the common side effects of the first-line treatment for the condition caused by mutations in the BRCA1 gene?"

Requires hops:
- Hop 1: Find what condition BRCA1 mutations cause
- Hop 2: Find first-line treatment for that condition
- Hop 3: Find side effects of that treatment

Supporting documents: (1) Genetics reference, (2) Treatment guidelines, (3) Drug side effects database

Evaluating on HotpotQA-Adapted Benchmark

Standard metrics from HotpotQA:

Answer F1: How similar is your answer to the gold answer (word-level overlap)?
Supporting Fact Recall: Did your system identify the documents that support the answer?
Reasoning Path Accuracy: Did it follow the correct logical path?

Healthy benchmark performance: Answer F1 >70%, Supporting Fact Recall >65%.

Graph-Based Reasoning Evaluation

Advanced multi-hop evaluation uses graph-based reasoning: model your documents and facts as a knowledge graph, then evaluate whether your system traverses the graph correctly.

Constructing a Knowledge Graph

For your domain, build a graph where:

Nodes: Entities (drugs, diseases, treatments, genes, etc.)
Edges: Relationships (treats, causes, is_a, has_side_effect, etc.)
Node attributes: Properties of the entity

Example (Medical):

Nodes: aspirin (drug), headache (condition), bleeding (side_effect), Bayer (manufacturer)
Edges: aspirin—treats→headache, aspirin—has_side_effect→bleeding, aspirin—manufactured_by→Bayer

Graph-Based Evaluation Protocol

Step 1: Question-to-Path Mapping

For each multi-hop question, determine the gold path through the graph:

Q: "What are side effects of the drug that treats headaches?"
Gold path: [start node: headache] → [treats edge] → [aspirin] → [has_side_effect edge] → [bleeding, nausea, ...]

Step 2: Trace System's Path

From the system's generated answers and reasoning, reverse-engineer what path it traversed:

System's implicit path: [implicitly: headache] → [treats] → [aspirin?] → [side_effects] → [correct answer]

Step 3: Calculate Path Metrics

Path Overlap: What fraction of edges in the gold path does the system traverse?
Path Efficiency: Did the system take unnecessary detours? (Ideal path length vs. actual)
Node Coverage: Did the system visit all necessary nodes?

Interpreting Graph-Based Results

If your system has high answer accuracy but low path overlap, it means:

It's getting lucky (right answer, wrong reasoning)
Multiple reasoning paths lead to the same answer (which is fine)
It's using shortcuts or domain knowledge beyond what's in the documents

If your system has low path overlap and low answer accuracy, it means:

Your multi-hop reasoning is fundamentally broken
The retriever isn't following logical chains
The generator isn't synthesizing across documents properly

Conclusion: Multi-Hop Mastery

Multi-hop evaluation is complex because multi-hop reasoning is complex. But it's essential—many real-world questions demand this reasoning. By using the protocols in this guide (6-step, HotpotQA-adapted, graph-based), you can systematically measure and improve your system's ability to reason across multiple documents.

The key insight: evaluate intermediate steps, not just final answers. This granularity shows you exactly where the chain breaks.

Key Takeaways

Multi-Hop Definition: Reasoning that requires information from multiple documents in sequence
Error Amplification: Each hop multiplies failure rate; N hops with 85% success = 61% overall (3-hop)
Intermediate Evaluation: Evaluate each hop separately, not just final answer
Chain Accuracy: Measure whether the system used correct reasoning, not just correct answer

Multi-Hop Failure Modes: Where Chains Break

Failure Type 1: Wrong First Hop — Retrieved first document is wrong. All downstream hops cascade from wrong foundation. Fix: Validate first hop before proceeding.

Failure Type 2: Irrelevant Intermediate Hops — Intermediate knowledge missing. Chain incomplete. Fix: Iterative retrieval with gap detection.

Failure Type 3: Compounding Errors — Each hop 90% accurate → 2 hops = 81%. Fix: Reduce required hops; add verification gates.

Evaluating Chain Quality, Not Just Final Answers

Score rubric: Hop 1-N correctness, chain coherence, final answer correctness. Intermediate steps matter as much as final answer.

Building Multi-Hop Test Datasets

Requirements: Genuine multi-hop reasoning (not single-retrieval answerable), clear ground truth for each step, varied difficulty 2-5 hops, explicit decomposition.

Process: Extract from knowledge graphs → Convert to natural language → Verify ground truth → Add distractors.

Benchmark Datasets: HotpotQA, 2WikiMultiHop, MuSiQue

HotpotQA: 113K Wikipedia-based multi-hop questions. 2+ article bridges required. Gold standard.

2WikiMultiHopQA: 192K English multi-hop. More challenging than HotpotQA. Requires true information synthesis across articles.

MuSiQue: 25K compositional QA. 2-4 hops with explicit decomposition annotations. Best for understanding reasoning steps.

Multi-Hop Optimization Strategies

1. Query Decomposition: Break multi-hop into explicit sub-queries. 42% → 58% F1 on HotpotQA.

2. Iterative Retrieval: After each hop, refine context for next. 58% → 71% F1.

3. Intermediate Verification: After each hop, verify plausibility before proceeding. Prevents error propagation. 71% → 76% F1.

4. Context Accumulation: Keep full context from all hops when synthesizing final answer.

5. Chain-of-Thought Prompting: Ask LLM to explicitly decompose reasoning. Often improves multi-hop reasoning. 76% → 81% F1.

Worked Example: Multi-Hop RAG Pipeline

Question: "What is the nationality of the person who won the award that was created in the year the actor in movie X was born?"

Pipeline:

Decompose: Identify 3 hops needed
Hop 1: "Who was the actor in movie X?" → Retrieve actor bio
Extract: Birth year of actor → 1965
Hop 2: "What award was created in 1965?" → Retrieve awards database
Extract: Award name → Nobel Prize 1965
Hop 3: "Who won Nobel Prize 1965?" → Retrieve winners
Extract: Winner name → Richard Feynman
Hop 4: "What is Feynman's nationality?" → American
Verify: Check each hop's retrieved doc supports the fact
Output: Final answer with reasoning chain

Evaluation:

Hop 1 correctness: Did we identify the right actor? ✓
Hop 2 correctness: Did we find award created in 1965? ✓
Hop 3 correctness: Did we identify 1965 winner? ✓
Hop 4 correctness: Did we get correct nationality? ✓
Final answer: Correct chain → correct answer ✓

Multi-Hop Monitoring Dashboard

Key Metrics:

Chain Success Rate: % of queries where all hops succeeded
Per-Hop Accuracy: Accuracy of each hop position (hop 1 vs. hop 2, etc.)
Error Propagation Rate: % of failures due to cascading from earlier hops
Average Chain Length: How many hops required on average?
Timeout Rate: Chains that timeout due to too many retrievals

Alerts: If chain success rate drops, investigate. If hop 1 accuracy drops, add verification. If cascading errors high, add intermediate gates.