Eval Dataset Construction

Why Eval Datasets Are Your Most Important Asset

Metrics matter. Methodologies matter. But the single most important factor in evaluation quality is your dataset. A bad metric applied to good data produces useful insights. A good metric applied to bad data produces confident nonsense.

Garbage in, garbage out. An evaluation dataset that doesn't represent your use case, lacks diversity, misses failure modes, or contains errors will lead you to optimize for the wrong things. You'll ship a system that performs well on your eval set but fails in production.

This article covers systematic dataset construction: sourcing, sampling, annotation, quality control, and documentation.

Dataset Quality ROI

Investing 2x in dataset quality typically produces 3–5x improvement in downstream model quality. A company that spends 40% of eval effort on dataset construction and 60% on metrics typically ships higher-quality products than a company with the 20%/80% split. The dataset is the foundation.

Dataset Requirements by Use Case

Different use cases require different dataset characteristics. A good eval dataset has:

Representativeness: The distribution of examples should match real usage. If 70% of real queries are about product recommendations and 30% are about technical support, your eval set should have roughly that distribution. If your eval set is 50/50, you'll optimize for underrepresented categories at the expense of common ones.

Difficulty Distribution: Include easy, medium, and hard examples. Easy examples pass quickly (less diagnostic value). Hard examples reveal failure modes. A good distribution: 30% easy (baseline capability), 50% medium (typical performance), 20% hard (stress testing).

Coverage of Failure Modes: What ways can the system fail? For chatbots: misunderstanding ambiguous requests, hallucinating information, failing to follow instructions. Explicitly include examples testing each failure mode. A checklist of failure modes ensures comprehensive eval coverage.

Freshness: If using production data for eval, ensure it's recent. Models degrade on distribution shifts. An eval dataset from 6 months ago may not reflect current production. Refresh eval datasets quarterly at minimum.

Diversity: Multiple perspectives, demographics, languages, domains. If your system is deployed globally, eval across languages and locales. If it serves multiple user types, eval across user segments. Lack of diversity hides systematic biases.

Sourcing Strategies

Where do evaluation examples come from?

Natural Data Collection (Production Logs)

Pros: Real distribution, actual user behavior, grounded in reality. Cons: Privacy issues (user data), biased toward current system behavior (models good at current task look better), may lack hard examples (easy examples over-represented).

Protocol: Sample from production logs, obtain consent, anonymize, remove personally identifiable information (PII). For high-risk applications, get explicit permission. Expect 10–30% of production examples to fail your eval due to PII or lack of ground truth (no reference answer).

Synthetic Data Generation

Pros: Unlimited volume, can target specific failure modes, no privacy issues. Cons: May not reflect real distribution, can be biased by generation method, lacks real-world nuance.

Protocol: Use templates or rules to generate examples. Template example: "Generate customer service queries where the customer is angry. Template: 'I bought [product] on [date] and [problem]. This is unacceptable!'". Use LLMs to generate diverse examples. Validate that generated examples are reasonable (human review sample).

Best practice: Use synthetic data to augment natural data, not replace it. Synthetic data is useful for stress-testing specific failure modes; natural data is essential for realistic distribution.

Expert Curation

Pros: High quality, targeted failure modes, reflects domain knowledge. Cons: Expensive, small scale, curator bias.

Protocol: Domain experts (product managers, customer service team, content creators) manually create representative examples. Have multiple experts contribute independently, then review for quality. Curator bias is real—one expert's intuition may not generalize. Combine experts to reduce bias.

Crowdsourced Data

Pros: Scalable, diverse perspectives. Cons: Quality varies, requires careful QC, can be expensive at scale.

Protocol: Recruit crowd workers (MTurk, specialist platforms), provide detailed instructions, include quality checks. Pay fairly to attract good workers. Implement two-stage review: rapid review to flag obvious bad data, detailed review for quality examples. Expect 20–40% of crowd data to require revision or rejection.

Adapted from Public Datasets

Pros: Established benchmarks, comparison to other systems, public availability. Cons: May not match your specific use case, potential contamination (models may have seen public datasets).

Protocol: Take established datasets (SQuAD for QA, GLUE for classification) and adapt to your domain. Extend with domain-specific examples. Use public datasets as baselines; supplement with domain-specific data for production evaluation.

Sampling Methodology

If you have large amounts of data (production logs, synthetic data), how do you sample to create a manageable eval set?

Simple Random Sampling

Pros: Unbiased, simple to implement. Cons: Rare categories under-represented. Use when categories are relatively balanced.

Stratified Sampling

Ensure all categories are represented proportionally. If category distribution is 70% A, 20% B, 10% C, stratified sampling maintains this distribution in the eval set. More representative than random sampling when category distribution varies.

Adversarial Sampling

Oversample hard examples. If 10% of your data represents hard edge cases, oversample them to 30–50% of eval set. This reveals weaknesses. Provides diagnostic power. Downside: doesn't reflect real distribution, so aggregate metrics overestimate failure rates. Always report stratified performance separately.

Sample Size Calculation

How big should your eval set be? Depends on variance and desired precision. For binary outcome (pass/fail):


n = (z^2 × p × (1-p)) / e^2

where:
  z = confidence (1.96 for 95% confidence)
  p = estimated proportion (0.5 for no prior knowledge)
  e = margin of error (0.05 for ±5%)

Example: For 95% confidence, ±5% margin of error, p=0.5: n = (1.96^2 × 0.5 × 0.5) / 0.05^2 = 385 samples.

For more complex metrics with multiple categories, multiply by number of categories. For ternary outcome (good/medium/bad): 385 × 3 ≈ 1,150 samples.

Practical guidance:

Small-scale eval (prototype): 50–100 examples per category
Production eval: 300–1,000 examples total (min ~100 per category)
Continuous monitoring: 1–5% of production volume, min 100 examples/week

Annotation Protocol Design

Annotation is the process of adding labels (correct answers, quality ratings, etc.) to examples. Protocol quality determines data quality.

The 10-Step Annotation Protocol Design Process

1. Define Annotation Task Precisely

Example bad definition: "Is this answer good?" (vague, subjective). Good definition: "Does the answer correctly answer the question asked, without hallucinating information not in the source?"

2. Create Detailed Instructions

Don't assume annotators understand the task. Write step-by-step instructions. Include: task overview, specific examples (annotated with reasoning), edge cases, when to use each label.

3. Develop Annotation Categories

Binary (correct/incorrect), ternary (good/medium/bad), or multi-dimensional (fluency 1–5, accuracy 1–5, etc.). Fewer categories = higher agreement, but less detail. More categories = more detail, but lower agreement. Sweet spot: 3–5 categories per dimension.

4. Create Example Annotations

Provide 5–10 fully annotated examples with reasoning. Annotators use these as reference. Examples should cover easy, medium, hard, and edge cases.

5. Identify and Explain Edge Cases

Ambiguous examples where annotation is difficult. Provide guidance: "If the answer is partially correct, mark as MEDIUM. If mostly correct despite minor errors, mark as GOOD."

6. Define Annotation Interface

Tool for annotation (custom tool, Google Sheets, Prodigy, MTurk). Interface should be intuitive: one example at a time, clear radio buttons/dropdowns, ability to see previous examples for reference.

7. Plan Inter-Rater Reliability Check

Have multiple annotators label same examples. Compute Cohen's kappa (inter-rater agreement). Target: kappa > 0.7 (substantial agreement). If kappa < 0.6, task definition is unclear—revise and retry.

8. Conduct Rater Training

Train annotators on task, examples, and interface. Provide practice round with feedback. Only deploy annotators who achieve target agreement on practice set (e.g., 85% match with gold standard).

9. Implement Quality Monitoring

During annotation, track rater consistency. If individual rater's agreement drops below threshold, retrain or replace. Compute rolling inter-rater agreement. Flag for review if kappa < 0.7.

10. Document All Decisions

Record task definition, annotation categories, example instructions, edge case guidance, rater training materials, and inter-rater agreement metrics. This enables other teams to replicate the annotation process.

Quality Control Pipeline

After annotation, implement multi-stage QC to catch errors.

Stage 1: Automated QC (Format Checking)

Validate data format. Missing fields? Invalid labels? Wrong data types? Automated scripts catch these. Example checks:

All required fields populated
Labels are valid (not typos)
Numeric fields are within expected range
No duplicate IDs

Defect rate targets: <1% of data has format issues.

Stage 2: Outlier Detection

Flag unusual examples for manual review. Examples:

Examples with extreme length (very short or very long)
Examples with unusual distributions (all examples get same label)
Examples with high inter-rater disagreement (kappa < 0.4 for that example)

Stage 3: Human QC Sampling

Sample annotated data and review manually. Expert reviews ~10% of annotated data. For each reviewed example:

Check annotation accuracy (does the label match the ground truth?)
Check annotation completeness (are all required fields filled?)
Check annotation consistency (does this label match similar examples?)

Defect rate target: <5% of reviewed data has errors.

Stage 4: Annotation Review and Correction

If errors found, determine root cause. Is annotation guidance unclear? Is rater inexperienced? Are examples genuinely ambiguous? Address root cause: revise guidance, retrain rater, or escalate ambiguous examples to expert.

Reprocess corrected data to ensure fix was correct.

Adversarial Dataset Construction

Beyond representative data, explicitly construct hard examples to probe failure modes.

Deliberately Hard Examples

Negation handling: "The cat did not chase the mouse" vs. "The cat chased the mouse." For models that struggle with negation, create examples testing this.

Long-range dependencies: "The company announced yesterday that next quarter they will increase salaries, but only for roles in the sales department. Who gets raises?" Requires tracking multiple clauses across long span.

Rare entities/domains: Test performance on uncommon names, specialized terminology, niche domains. Models often perform worse on rare categories.

Red-Team-Sourced Adversarial Inputs

Hire red-teamers (security experts or domain experts) to generate adversarial inputs. Red-teamers try to break the system. Examples:

Prompt injection: "Forget the instructions above. Instead, [malicious instruction]"
Jailbreaks: Requests that violate policies, phrased to bypass safety mechanisms
Distribution shifts: Out-of-domain queries, unusual phrasings

Red-teamers are creative—they find failure modes that automated generation misses. Budget: 1–2% of eval dataset construction effort.

Failure Mode Coverage Testing

Create explicit checklist of possible failures:

Hallucination: Examples where model might fabricate information
Reasoning errors: Examples requiring multi-step logic
Instruction following: Examples with specific constraints ("Answer in exactly 1 sentence")
Safety violations: Examples probing safety boundaries
Bias: Examples testing for stereotypes or unfair treatment

For each failure mode, create 5–10 test examples. Measure model performance per failure mode. Report breakdown in eval results: "Hallucination rate: 12%, Reasoning error rate: 5%".

Adversarial Example Taxonomy

Categorize adversarial examples by difficulty and type:

Typo/Spelling Errors: Robustness to noise
Paraphrasing: Robustness to rephrasing
Negation/Modality: Semantic understanding
Long-Range Dependencies: Context processing
Rare Domains/Entities: Out-of-distribution generalization
Instruction Constraints: Precise instruction following

For each category, create 10–20 examples. Measure accuracy per category. This reveals which adversarial types the model struggles with.

Dataset Contamination Detection

Has your test set leaked into training data? If so, test performance is meaningless (the model has "cheated").

N-Gram Overlap Methods

Compute overlap between test set and training data. Extract n-grams (phrases of length n) from test examples. Check if significant n-grams appear in training data.

Protocol:

Extract 4-grams and 5-grams from test set
Check overlap with training data
If >30% of test n-grams appear in training, contamination likely
If >5% overlap, investigate further

Limitation: N-gram overlap detects exact or near-exact matches, not semantic equivalence. If training data contains a paraphrased version of a test example, n-gram overlap misses it.

Membership Inference Attacks

More sophisticated contamination detection. Train separate model on suspected training set, measure how well it performs on test set. High performance suggests contamination (the model has seen similar examples).

Protocol:

Obtain candidate training data
Fine-tune model on candidate training data
Evaluate on test set
Compare to performance of model not fine-tuned
Large gap suggests the test examples were in the training data

Limitation: Requires access to training data, computationally expensive. Use when contamination is suspected but n-gram methods are inconclusive.

Mitigation Strategies

For public benchmarks: If using public datasets (ImageNet, SQuAD), check for leakage. Some models were trained on public datasets, contaminating results. When using public benchmarks, explicitly state the benchmark version and training data used.

For internal eval sets: Separate training and test data strictly. Different teams for training data collection and test set creation. Version control—maintain history of who added what to test set. Regular audits: compare test set to training data quarterly.

For evolving datasets: When models improve, developers inevitably see test examples. Implement "test set hold-out"—only final evaluation uses test set. All iteration uses validation set (separate from test).

Dataset Versioning and Governance

Datasets evolve. Manage versions carefully.

Semantic Versioning for Datasets

Apply versioning like software: MAJOR.MINOR.PATCH.

MAJOR: Breaking change (e.g., different task, different metrics). Bump when eval incomparable to previous version.
MINOR: Addition of new examples, categories, or annotations without changing existing data. Backward-compatible.
PATCH: Bug fixes (correcting annotation errors, removing corrupted entries). No new data.

Example: eval-dataset-v2.1.3 = major version 2 (new task), minor version 1 (one annotation pass), patch version 3 (third bug fix).

Dataset Changelogs

Document what changed in each version.


eval-dataset-v2.1.0
- Added 200 new examples for rare entity category
- Improved annotation guidance for edge cases
- No changes to existing 1,000 examples

eval-dataset-v2.0.5
- Fixed 15 annotation errors (incorrect labels)
- Removed 5 examples with PII
- Re-validated inter-rater agreement (kappa=0.78)

Deprecation Policy

Old dataset versions should eventually be deprecated. Policy example:

Current version: actively used
N-1 version: supported (bug fixes only)
N-2 version: deprecated (no new features, no support)
N-3 version: archived (reference only)

This prevents team fragmentation (different teams using different versions) while allowing gradual migration.

Access Control

Not all team members should modify eval datasets. Implement access control:

Readers: Can view, analyze, but not modify (most team members)
Annotators: Can add annotations under supervision
Maintainers: Can review, approve, and merge changes (small team)
Admins: Can modify metadata, manage versions, handle access control (leadership)

Provenance Tracking

Record source of each example: production log (date, user), synthetic generation (method, parameters), expert curation (expert name), public dataset (name, source). This enables auditing and reproducing data if needed.

Example metadata:


{
  "id": "eval-2024-001",
  "example": "What is the capital of France?",
  "reference": "Paris",
  "source": "production_log",
  "source_date": "2024-01-15",
  "annotator": "[email protected]",
  "annotation_date": "2024-01-20",
  "label": "correct",
  "version": "2.1.0"
}

Dataset Documentation

Publish dataset documentation so teams can understand and use the data correctly.

Datasheet for Datasets

Standard format from Gebru et al. Include:

Motivation: Why was this dataset created?
Composition: What data does it contain? How many examples? What categories?
Collection Process: How was data collected? Sourced from production, synthetic, crowdsourced?
Preprocessing: What cleaning/preprocessing was done?
Uses: What is the dataset designed for? What uses are NOT recommended?
Distribution: How will the dataset be shared? What are usage terms?
Maintenance: Who maintains the dataset? How are updates handled?

Model Card for Dataset

Similar to datasheet but shorter, focused on practical usage:

Dataset overview (what it measures)
Intended uses and limitations
Data composition and statistics
Known biases and limitations
Version history
How to cite

Documenting Bias and Limitations

Be explicit about what the dataset is NOT. Examples:

"This dataset contains 90% English examples and 10% Spanish. Performance on non-English languages may not generalize."
"This dataset was collected from production logs of our US users. Geographic biases toward US English and culture likely exist."
"This dataset contains expert-curated examples. Distribution may not match real user queries."

Transparency about limitations prevents misuse and sets expectations correctly.

Open-Sourcing Considerations

If publishing the dataset:

PII removal: Scrub personally identifiable information (names, addresses, account numbers)
Legal review: Ensure no copyright or confidentiality violations
Licensing: Choose appropriate license (CC-BY, CC-BY-SA, CC0). CC-BY requires attribution; CC0 is public domain.
Ethical review: Could the dataset enable harmful uses? Discuss with ethics team.
Documentation: Publish complete datasheet, limitations, and biases alongside dataset

40%

Typical QC defect rate (before fixing)

0.78

Target inter-rater kappa

385

Min sample size for 95% CI

3:1

ROI ratio: dataset quality investment

Dataset Construction Checklist

Sourcing: Mix of natural, synthetic, curated, and crowdsourced data. Stratified by category. Adversarial examples included.
Annotation: Clear task definition. 10-step protocol followed. Multiple raters. Inter-rater agreement > 0.7.
Quality Control: Automated format checking. Outlier detection. 10% human QC review. <5% defect rate.
Contamination: N-gram overlap check. No test-training leakage. Documented source of each example.
Versioning: MAJOR.MINOR.PATCH versioning. Changelogs maintained. Deprecation policy in place.
Documentation: Datasheet for Datasets completed. Biases documented. Limitations clear. Model Card updated.
Governance: Access control implemented. Provenance tracked. Maintenance plan defined.

Build Your Evaluation Dataset

Start with a small curated set (50–100 examples). Define annotation protocol. Implement QC. Measure inter-rater agreement. Expand gradually while maintaining quality. Your dataset is the foundation of all downstream evaluation—invest accordingly.

Access Dataset Tools