Introduction: Why Multi-Hop Matters
Many questions cannot be answered by retrieving and synthesizing a single document. Instead, they require combining information from multiple documents in a specific sequence. Example: "What is the capital of the country where the inventor of the X-ray was born?" This requires chaining: find inventor → find birthplace → find capital of that place.
Multi-hop RAG evaluation is the discipline of assessing whether your system can handle these chained-reasoning questions. This is harder than single-hop RAG because failures can occur at multiple stages: maybe retrieval fails at step 2, or generation fails at the synthesis step.
This guide teaches you to systematically evaluate multi-hop reasoning and identify where your system breaks down.
Defining Multi-Hop Reasoning
Multi-hop reasoning is reasoning that requires more than one "hop" through the document graph. Hops are document-to-document connections:
- Hop 1: Retrieve documents mentioning the X-ray inventor
- Hop 2: Retrieve documents about the inventor's birthplace
- Hop 3: Retrieve documents about capitals of that country
- Answer synthesis: Combine to answer the original question
In production, multi-hop questions are common. "What are the most common side effects of drugs that treat the condition mentioned in this patient's diagnosis?" requires multi-hop reasoning.
The Multi-Hop Challenge
Multi-hop evaluation is harder than single-hop evaluation for three reasons:
1. Error Amplification
In single-hop RAG, your error rate is roughly equal to your retrieval error + generation error. In multi-hop RAG, errors compound. If each hop has 85% success, a 3-hop question has only (0.85)³ = 61% chance of success overall. This is why multi-hop systems often underperform.
2. Intermediate Step Evaluation
With single-hop, you only evaluate the final answer. With multi-hop, you must evaluate each intermediate step. Did the retriever correctly identify the X-ray inventor? Did it then retrieve relevant information about that inventor's birthplace? These intermediate evaluations are tedious to construct.
3. Reasoning Chain Transparency
For evaluation, you need to understand what reasoning the system used. Did it correctly identify the causal chain? Or did it guess? Single-hop systems can be evaluated by answer alone. Multi-hop systems need visible reasoning chains.
Multi-Hop Specific Metrics
Chain Accuracy
Did the system identify the correct reasoning chain? Example: for "What is the capital of the country where the X-ray inventor was born?", the correct chain is:
[X-ray inventor: Wilhelm Röntgen] → [Born: Germany] → [Capital: Berlin]
If your system produces the right answer (Berlin) but via wrong chain (e.g., it looked up something about Einstein instead), chain accuracy = 0 even though final answer is correct.
Measure: does the reasoning path match the expected path? Score: 1 = correct chain, 0 = wrong or hallucinated chain.
Intermediate Hop Accuracy
For each hop in the chain, measure whether the intermediate result is correct:
- Hop 1 Accuracy: Did it correctly identify the X-ray inventor? (Score: 1 or 0)
- Hop 2 Accuracy: Did it correctly identify their birthplace? (Score: 1 or 0)
- Hop 3 Accuracy: Did it correctly identify the capital? (Score: 1 or 0)
This granularity shows which hops fail most frequently. If Hop 2 has 60% accuracy while Hop 1 and 3 have 90%+, you know where to focus.
Answer Faithfulness Under Incomplete Reasoning
Sometimes the system gets the right answer despite reasoning flaws. Measure whether the final answer is faithful to what the reasoning should have produced, even if reasoning itself was flawed.
Score: 1 = answer matches what should follow from reasoning, 0 = answer doesn't match reasoning, N/A = reasoning too garbled to evaluate.
The 6-Step Evaluation Protocol
Step 1: Construct Multi-Hop Test Set
Create 30-50 multi-hop questions with 2-4 required hops. Example format:
Q: "In what year did the Nobel Prize laureate who discovered the structure of DNA have their first published work?"
Ground Truth Reasoning Chain:
- Hop 1: Nobel Prize + DNA structure → James Watson (or Francis Crick, Rosalind Franklin)
- Hop 2: First publication of James Watson → [find earliest paper]
- Hop 3: Publication year → [get year]
Answer: 1950 (or whatever the actual year was)
Required Documents:
- Doc 1: Information about Nobel Prize laureates and DNA
- Doc 2: Information about Watson's early publications
- Doc 3: Publication dates/timeline
Step 2: Identify Gold Standard Reasoning Chains
For each question, manually construct the correct reasoning chain. What documents must be retrieved, and in what order? How does information connect across documents?
Have 2-3 domain experts review each chain to ensure it's canonical.
Step 3: Run System and Capture Intermediate Steps
Configure your RAG system to produce not just the final answer, but also:
- Each retrieval query generated
- Documents retrieved per hop
- Intermediate answers/reasoning before final synthesis
This is critical for understanding where the system failed.
Step 4: Evaluate Chain Correctness
For each system output, evaluate:
- Does the reasoning chain match the gold standard? (Score: 1 = yes, 0 = no)
- Does each hop retrieve relevant documents? (Score: hop accuracy, 0-100%)
- Are intermediate conclusions correct? (Score per hop: correct, partially correct, or wrong)
Step 5: Evaluate Final Answer
Measure whether the final answer is correct, independent of reasoning chain:
- Exact match: answer matches gold standard exactly (Score: 1)
- Partial credit: answer is mostly correct but missing nuance (Score: 0.5-0.9)
- Wrong: answer contradicts gold standard (Score: 0)
Step 6: Calculate Multi-Hop Metrics
Chain Accuracy = (Correct chains) / (Total questions)
Hop-1 Accuracy = (Correct Hop 1 results) / (Total questions)
Hop-2 Accuracy = (Correct Hop 2 results) / (Total questions)
...
Overall Answer Accuracy = (Correct final answers) / (Total questions)
Multi-Hop System Quality = (Chain Accuracy * 0.3) + (Avg Hop Accuracy * 0.4) + (Answer Accuracy * 0.3)
This weighted metric captures whether the system not only gets the right answer, but via the right reasoning.
HotpotQA for Production
HotpotQA is a benchmark specifically designed for multi-hop QA evaluation. It contains 113K Wikipedia-based questions requiring 2-hop reasoning over 10+ retrieved documents.
Adapting HotpotQA for Your Domain
Rather than using Wikipedia facts, create a domain-adapted version:
- Identify 2-3 core entity types in your domain (e.g., "drug", "disease", "treatment")
- Create questions that require hopping across these entity types
- Ensure each question requires at least 2 documents to answer correctly
- Create supporting document sets
Example (Medical Domain):
Q: "What are the common side effects of the first-line treatment for the condition caused by mutations in the BRCA1 gene?"
Requires hops:
- Hop 1: Find what condition BRCA1 mutations cause
- Hop 2: Find first-line treatment for that condition
- Hop 3: Find side effects of that treatment
Supporting documents: (1) Genetics reference, (2) Treatment guidelines, (3) Drug side effects database
Evaluating on HotpotQA-Adapted Benchmark
Standard metrics from HotpotQA:
- Answer F1: How similar is your answer to the gold answer (word-level overlap)?
- Supporting Fact Recall: Did your system identify the documents that support the answer?
- Reasoning Path Accuracy: Did it follow the correct logical path?
Healthy benchmark performance: Answer F1 >70%, Supporting Fact Recall >65%.
Graph-Based Reasoning Evaluation
Advanced multi-hop evaluation uses graph-based reasoning: model your documents and facts as a knowledge graph, then evaluate whether your system traverses the graph correctly.
Constructing a Knowledge Graph
For your domain, build a graph where:
- Nodes: Entities (drugs, diseases, treatments, genes, etc.)
- Edges: Relationships (treats, causes, is_a, has_side_effect, etc.)
- Node attributes: Properties of the entity
Example (Medical):
Nodes: aspirin (drug), headache (condition), bleeding (side_effect), Bayer (manufacturer)
Edges: aspirin—treats→headache, aspirin—has_side_effect→bleeding, aspirin—manufactured_by→Bayer
Graph-Based Evaluation Protocol
Step 1: Question-to-Path Mapping
For each multi-hop question, determine the gold path through the graph:
Q: "What are side effects of the drug that treats headaches?"
Gold path: [start node: headache] → [treats edge] → [aspirin] → [has_side_effect edge] → [bleeding, nausea, ...]
Step 2: Trace System's Path
From the system's generated answers and reasoning, reverse-engineer what path it traversed:
System's implicit path: [implicitly: headache] → [treats] → [aspirin?] → [side_effects] → [correct answer]
Step 3: Calculate Path Metrics
- Path Overlap: What fraction of edges in the gold path does the system traverse?
- Path Efficiency: Did the system take unnecessary detours? (Ideal path length vs. actual)
- Node Coverage: Did the system visit all necessary nodes?
Interpreting Graph-Based Results
If your system has high answer accuracy but low path overlap, it means:
- It's getting lucky (right answer, wrong reasoning)
- Multiple reasoning paths lead to the same answer (which is fine)
- It's using shortcuts or domain knowledge beyond what's in the documents
If your system has low path overlap and low answer accuracy, it means:
- Your multi-hop reasoning is fundamentally broken
- The retriever isn't following logical chains
- The generator isn't synthesizing across documents properly
Conclusion: Multi-Hop Mastery
Multi-hop evaluation is complex because multi-hop reasoning is complex. But it's essential—many real-world questions demand this reasoning. By using the protocols in this guide (6-step, HotpotQA-adapted, graph-based), you can systematically measure and improve your system's ability to reason across multiple documents.
The key insight: evaluate intermediate steps, not just final answers. This granularity shows you exactly where the chain breaks.
