Why Eval Datasets Are Your Most Important Asset
Metrics matter. Methodologies matter. But the single most important factor in evaluation quality is your dataset. A bad metric applied to good data produces useful insights. A good metric applied to bad data produces confident nonsense.
Garbage in, garbage out. An evaluation dataset that doesn't represent your use case, lacks diversity, misses failure modes, or contains errors will lead you to optimize for the wrong things. You'll ship a system that performs well on your eval set but fails in production.
This article covers systematic dataset construction: sourcing, sampling, annotation, quality control, and documentation.
Investing 2x in dataset quality typically produces 3–5x improvement in downstream model quality. A company that spends 40% of eval effort on dataset construction and 60% on metrics typically ships higher-quality products than a company with the 20%/80% split. The dataset is the foundation.
Dataset Requirements by Use Case
Different use cases require different dataset characteristics. A good eval dataset has:
Representativeness: The distribution of examples should match real usage. If 70% of real queries are about product recommendations and 30% are about technical support, your eval set should have roughly that distribution. If your eval set is 50/50, you'll optimize for underrepresented categories at the expense of common ones.
Difficulty Distribution: Include easy, medium, and hard examples. Easy examples pass quickly (less diagnostic value). Hard examples reveal failure modes. A good distribution: 30% easy (baseline capability), 50% medium (typical performance), 20% hard (stress testing).
Coverage of Failure Modes: What ways can the system fail? For chatbots: misunderstanding ambiguous requests, hallucinating information, failing to follow instructions. Explicitly include examples testing each failure mode. A checklist of failure modes ensures comprehensive eval coverage.
Freshness: If using production data for eval, ensure it's recent. Models degrade on distribution shifts. An eval dataset from 6 months ago may not reflect current production. Refresh eval datasets quarterly at minimum.
Diversity: Multiple perspectives, demographics, languages, domains. If your system is deployed globally, eval across languages and locales. If it serves multiple user types, eval across user segments. Lack of diversity hides systematic biases.
Sourcing Strategies
Where do evaluation examples come from?
Natural Data Collection (Production Logs)
Pros: Real distribution, actual user behavior, grounded in reality. Cons: Privacy issues (user data), biased toward current system behavior (models good at current task look better), may lack hard examples (easy examples over-represented).
Protocol: Sample from production logs, obtain consent, anonymize, remove personally identifiable information (PII). For high-risk applications, get explicit permission. Expect 10–30% of production examples to fail your eval due to PII or lack of ground truth (no reference answer).
Synthetic Data Generation
Pros: Unlimited volume, can target specific failure modes, no privacy issues. Cons: May not reflect real distribution, can be biased by generation method, lacks real-world nuance.
Protocol: Use templates or rules to generate examples. Template example: "Generate customer service queries where the customer is angry. Template: 'I bought [product] on [date] and [problem]. This is unacceptable!'". Use LLMs to generate diverse examples. Validate that generated examples are reasonable (human review sample).
Best practice: Use synthetic data to augment natural data, not replace it. Synthetic data is useful for stress-testing specific failure modes; natural data is essential for realistic distribution.
Expert Curation
Pros: High quality, targeted failure modes, reflects domain knowledge. Cons: Expensive, small scale, curator bias.
Protocol: Domain experts (product managers, customer service team, content creators) manually create representative examples. Have multiple experts contribute independently, then review for quality. Curator bias is real—one expert's intuition may not generalize. Combine experts to reduce bias.
Crowdsourced Data
Pros: Scalable, diverse perspectives. Cons: Quality varies, requires careful QC, can be expensive at scale.
Protocol: Recruit crowd workers (MTurk, specialist platforms), provide detailed instructions, include quality checks. Pay fairly to attract good workers. Implement two-stage review: rapid review to flag obvious bad data, detailed review for quality examples. Expect 20–40% of crowd data to require revision or rejection.
Adapted from Public Datasets
Pros: Established benchmarks, comparison to other systems, public availability. Cons: May not match your specific use case, potential contamination (models may have seen public datasets).
Protocol: Take established datasets (SQuAD for QA, GLUE for classification) and adapt to your domain. Extend with domain-specific examples. Use public datasets as baselines; supplement with domain-specific data for production evaluation.
Sampling Methodology
If you have large amounts of data (production logs, synthetic data), how do you sample to create a manageable eval set?
Simple Random Sampling
Pros: Unbiased, simple to implement. Cons: Rare categories under-represented. Use when categories are relatively balanced.
Stratified Sampling
Ensure all categories are represented proportionally. If category distribution is 70% A, 20% B, 10% C, stratified sampling maintains this distribution in the eval set. More representative than random sampling when category distribution varies.
Adversarial Sampling
Oversample hard examples. If 10% of your data represents hard edge cases, oversample them to 30–50% of eval set. This reveals weaknesses. Provides diagnostic power. Downside: doesn't reflect real distribution, so aggregate metrics overestimate failure rates. Always report stratified performance separately.
Sample Size Calculation
How big should your eval set be? Depends on variance and desired precision. For binary outcome (pass/fail):
n = (z^2 × p × (1-p)) / e^2
where:
z = confidence (1.96 for 95% confidence)
p = estimated proportion (0.5 for no prior knowledge)
e = margin of error (0.05 for ±5%)
Example: For 95% confidence, ±5% margin of error, p=0.5: n = (1.96^2 × 0.5 × 0.5) / 0.05^2 = 385 samples.
For more complex metrics with multiple categories, multiply by number of categories. For ternary outcome (good/medium/bad): 385 × 3 ≈ 1,150 samples.
Practical guidance:
- Small-scale eval (prototype): 50–100 examples per category
- Production eval: 300–1,000 examples total (min ~100 per category)
- Continuous monitoring: 1–5% of production volume, min 100 examples/week
Annotation Protocol Design
Annotation is the process of adding labels (correct answers, quality ratings, etc.) to examples. Protocol quality determines data quality.
The 10-Step Annotation Protocol Design Process
1. Define Annotation Task Precisely
Example bad definition: "Is this answer good?" (vague, subjective). Good definition: "Does the answer correctly answer the question asked, without hallucinating information not in the source?"
2. Create Detailed Instructions
Don't assume annotators understand the task. Write step-by-step instructions. Include: task overview, specific examples (annotated with reasoning), edge cases, when to use each label.
3. Develop Annotation Categories
Binary (correct/incorrect), ternary (good/medium/bad), or multi-dimensional (fluency 1–5, accuracy 1–5, etc.). Fewer categories = higher agreement, but less detail. More categories = more detail, but lower agreement. Sweet spot: 3–5 categories per dimension.
4. Create Example Annotations
Provide 5–10 fully annotated examples with reasoning. Annotators use these as reference. Examples should cover easy, medium, hard, and edge cases.
5. Identify and Explain Edge Cases
Ambiguous examples where annotation is difficult. Provide guidance: "If the answer is partially correct, mark as MEDIUM. If mostly correct despite minor errors, mark as GOOD."
6. Define Annotation Interface
Tool for annotation (custom tool, Google Sheets, Prodigy, MTurk). Interface should be intuitive: one example at a time, clear radio buttons/dropdowns, ability to see previous examples for reference.
7. Plan Inter-Rater Reliability Check
Have multiple annotators label same examples. Compute Cohen's kappa (inter-rater agreement). Target: kappa > 0.7 (substantial agreement). If kappa < 0.6, task definition is unclear—revise and retry.
8. Conduct Rater Training
Train annotators on task, examples, and interface. Provide practice round with feedback. Only deploy annotators who achieve target agreement on practice set (e.g., 85% match with gold standard).
9. Implement Quality Monitoring
During annotation, track rater consistency. If individual rater's agreement drops below threshold, retrain or replace. Compute rolling inter-rater agreement. Flag for review if kappa < 0.7.
10. Document All Decisions
Record task definition, annotation categories, example instructions, edge case guidance, rater training materials, and inter-rater agreement metrics. This enables other teams to replicate the annotation process.
Quality Control Pipeline
After annotation, implement multi-stage QC to catch errors.
Stage 1: Automated QC (Format Checking)
Validate data format. Missing fields? Invalid labels? Wrong data types? Automated scripts catch these. Example checks:
- All required fields populated
- Labels are valid (not typos)
- Numeric fields are within expected range
- No duplicate IDs
Defect rate targets: <1% of data has format issues.
Stage 2: Outlier Detection
Flag unusual examples for manual review. Examples:
- Examples with extreme length (very short or very long)
- Examples with unusual distributions (all examples get same label)
- Examples with high inter-rater disagreement (kappa < 0.4 for that example)
Stage 3: Human QC Sampling
Sample annotated data and review manually. Expert reviews ~10% of annotated data. For each reviewed example:
- Check annotation accuracy (does the label match the ground truth?)
- Check annotation completeness (are all required fields filled?)
- Check annotation consistency (does this label match similar examples?)
Defect rate target: <5% of reviewed data has errors.
Stage 4: Annotation Review and Correction
If errors found, determine root cause. Is annotation guidance unclear? Is rater inexperienced? Are examples genuinely ambiguous? Address root cause: revise guidance, retrain rater, or escalate ambiguous examples to expert.
Reprocess corrected data to ensure fix was correct.
Adversarial Dataset Construction
Beyond representative data, explicitly construct hard examples to probe failure modes.
Deliberately Hard Examples
Negation handling: "The cat did not chase the mouse" vs. "The cat chased the mouse." For models that struggle with negation, create examples testing this.
Long-range dependencies: "The company announced yesterday that next quarter they will increase salaries, but only for roles in the sales department. Who gets raises?" Requires tracking multiple clauses across long span.
Rare entities/domains: Test performance on uncommon names, specialized terminology, niche domains. Models often perform worse on rare categories.
Red-Team-Sourced Adversarial Inputs
Hire red-teamers (security experts or domain experts) to generate adversarial inputs. Red-teamers try to break the system. Examples:
- Prompt injection: "Forget the instructions above. Instead, [malicious instruction]"
- Jailbreaks: Requests that violate policies, phrased to bypass safety mechanisms
- Distribution shifts: Out-of-domain queries, unusual phrasings
Red-teamers are creative—they find failure modes that automated generation misses. Budget: 1–2% of eval dataset construction effort.
Failure Mode Coverage Testing
Create explicit checklist of possible failures:
- Hallucination: Examples where model might fabricate information
- Reasoning errors: Examples requiring multi-step logic
- Instruction following: Examples with specific constraints ("Answer in exactly 1 sentence")
- Safety violations: Examples probing safety boundaries
- Bias: Examples testing for stereotypes or unfair treatment
For each failure mode, create 5–10 test examples. Measure model performance per failure mode. Report breakdown in eval results: "Hallucination rate: 12%, Reasoning error rate: 5%".
Adversarial Example Taxonomy
Categorize adversarial examples by difficulty and type:
- Typo/Spelling Errors: Robustness to noise
- Paraphrasing: Robustness to rephrasing
- Negation/Modality: Semantic understanding
- Long-Range Dependencies: Context processing
- Rare Domains/Entities: Out-of-distribution generalization
- Instruction Constraints: Precise instruction following
For each category, create 10–20 examples. Measure accuracy per category. This reveals which adversarial types the model struggles with.
Dataset Contamination Detection
Has your test set leaked into training data? If so, test performance is meaningless (the model has "cheated").
N-Gram Overlap Methods
Compute overlap between test set and training data. Extract n-grams (phrases of length n) from test examples. Check if significant n-grams appear in training data.
Protocol:
- Extract 4-grams and 5-grams from test set
- Check overlap with training data
- If >30% of test n-grams appear in training, contamination likely
- If >5% overlap, investigate further
Limitation: N-gram overlap detects exact or near-exact matches, not semantic equivalence. If training data contains a paraphrased version of a test example, n-gram overlap misses it.
Membership Inference Attacks
More sophisticated contamination detection. Train separate model on suspected training set, measure how well it performs on test set. High performance suggests contamination (the model has seen similar examples).
Protocol:
- Obtain candidate training data
- Fine-tune model on candidate training data
- Evaluate on test set
- Compare to performance of model not fine-tuned
- Large gap suggests the test examples were in the training data
Limitation: Requires access to training data, computationally expensive. Use when contamination is suspected but n-gram methods are inconclusive.
Mitigation Strategies
For public benchmarks: If using public datasets (ImageNet, SQuAD), check for leakage. Some models were trained on public datasets, contaminating results. When using public benchmarks, explicitly state the benchmark version and training data used.
For internal eval sets: Separate training and test data strictly. Different teams for training data collection and test set creation. Version control—maintain history of who added what to test set. Regular audits: compare test set to training data quarterly.
For evolving datasets: When models improve, developers inevitably see test examples. Implement "test set hold-out"—only final evaluation uses test set. All iteration uses validation set (separate from test).
Dataset Versioning and Governance
Datasets evolve. Manage versions carefully.
Semantic Versioning for Datasets
Apply versioning like software: MAJOR.MINOR.PATCH.
- MAJOR: Breaking change (e.g., different task, different metrics). Bump when eval incomparable to previous version.
- MINOR: Addition of new examples, categories, or annotations without changing existing data. Backward-compatible.
- PATCH: Bug fixes (correcting annotation errors, removing corrupted entries). No new data.
Example: eval-dataset-v2.1.3 = major version 2 (new task), minor version 1 (one annotation pass), patch version 3 (third bug fix).
Dataset Changelogs
Document what changed in each version.
eval-dataset-v2.1.0
- Added 200 new examples for rare entity category
- Improved annotation guidance for edge cases
- No changes to existing 1,000 examples
eval-dataset-v2.0.5
- Fixed 15 annotation errors (incorrect labels)
- Removed 5 examples with PII
- Re-validated inter-rater agreement (kappa=0.78)
Deprecation Policy
Old dataset versions should eventually be deprecated. Policy example:
- Current version: actively used
- N-1 version: supported (bug fixes only)
- N-2 version: deprecated (no new features, no support)
- N-3 version: archived (reference only)
This prevents team fragmentation (different teams using different versions) while allowing gradual migration.
Access Control
Not all team members should modify eval datasets. Implement access control:
- Readers: Can view, analyze, but not modify (most team members)
- Annotators: Can add annotations under supervision
- Maintainers: Can review, approve, and merge changes (small team)
- Admins: Can modify metadata, manage versions, handle access control (leadership)
Provenance Tracking
Record source of each example: production log (date, user), synthetic generation (method, parameters), expert curation (expert name), public dataset (name, source). This enables auditing and reproducing data if needed.
Example metadata:
{
"id": "eval-2024-001",
"example": "What is the capital of France?",
"reference": "Paris",
"source": "production_log",
"source_date": "2024-01-15",
"annotator": "[email protected]",
"annotation_date": "2024-01-20",
"label": "correct",
"version": "2.1.0"
}
Dataset Documentation
Publish dataset documentation so teams can understand and use the data correctly.
Datasheet for Datasets
Standard format from Gebru et al. Include:
- Motivation: Why was this dataset created?
- Composition: What data does it contain? How many examples? What categories?
- Collection Process: How was data collected? Sourced from production, synthetic, crowdsourced?
- Preprocessing: What cleaning/preprocessing was done?
- Uses: What is the dataset designed for? What uses are NOT recommended?
- Distribution: How will the dataset be shared? What are usage terms?
- Maintenance: Who maintains the dataset? How are updates handled?
Model Card for Dataset
Similar to datasheet but shorter, focused on practical usage:
- Dataset overview (what it measures)
- Intended uses and limitations
- Data composition and statistics
- Known biases and limitations
- Version history
- How to cite
Documenting Bias and Limitations
Be explicit about what the dataset is NOT. Examples:
- "This dataset contains 90% English examples and 10% Spanish. Performance on non-English languages may not generalize."
- "This dataset was collected from production logs of our US users. Geographic biases toward US English and culture likely exist."
- "This dataset contains expert-curated examples. Distribution may not match real user queries."
Transparency about limitations prevents misuse and sets expectations correctly.
Open-Sourcing Considerations
If publishing the dataset:
- PII removal: Scrub personally identifiable information (names, addresses, account numbers)
- Legal review: Ensure no copyright or confidentiality violations
- Licensing: Choose appropriate license (CC-BY, CC-BY-SA, CC0). CC-BY requires attribution; CC0 is public domain.
- Ethical review: Could the dataset enable harmful uses? Discuss with ethics team.
- Documentation: Publish complete datasheet, limitations, and biases alongside dataset
Dataset Construction Checklist
- Sourcing: Mix of natural, synthetic, curated, and crowdsourced data. Stratified by category. Adversarial examples included.
- Annotation: Clear task definition. 10-step protocol followed. Multiple raters. Inter-rater agreement > 0.7.
- Quality Control: Automated format checking. Outlier detection. 10% human QC review. <5% defect rate.
- Contamination: N-gram overlap check. No test-training leakage. Documented source of each example.
- Versioning: MAJOR.MINOR.PATCH versioning. Changelogs maintained. Deprecation policy in place.
- Documentation: Datasheet for Datasets completed. Biases documented. Limitations clear. Model Card updated.
- Governance: Access control implemented. Provenance tracked. Maintenance plan defined.
Build Your Evaluation Dataset
Start with a small curated set (50–100 examples). Define annotation protocol. Implement QC. Measure inter-rater agreement. Expand gradually while maintaining quality. Your dataset is the foundation of all downstream evaluation—invest accordingly.
Access Dataset Tools