What Is Item Analysis and Why It Matters

Item analysis is the science of measuring how well individual test items (questions, problems, tasks) function. It answers three key questions:

  1. How difficult is this item? (p-value)
  2. Does this item discriminate between high and low performers? (discrimination index)
  3. Does this item function differently for different demographic groups? (DIF analysis)

Why does this matter? Because items that don't discriminate or that are biased add noise, not signal. An exam with 20 items, 5 of which are terrible, is really only a 15-item exam. You're wasting resources assessing those bad items and making decisions based on noisier data.

Real-world impact: A certification exam that uses poorly analyzed items might pass candidates who can't actually do the job, or fail candidates who can. This damages the credibility of the certification.

Item analysis is the quality control process for evaluation instruments.

Item Difficulty (p-value)

Definition and Calculation

Item difficulty, denoted p, is the proportion of test-takers who answer the item correctly (or score above a threshold on it).

Formula: p = (number of correct responses) / (total number of responses)

Example: 87 out of 100 candidates answered question #5 correctly. p = 0.87.

Interpreting p-values

p = 1.0 (100%) - Everyone got it right. The item is too easy. It provides no discrimination.

p = 0.7-0.9 (70-90%) - Easy item. Good for building confidence in test-takers. Useful for certification exams where you want some easy items to separate true non-performers from true performers.

p = 0.3-0.7 (30-70%) - Optimal difficulty range. Items in this range discriminate well. For general assessments, target p = 0.5.

p = 0.1-0.3 (10-30%) - Hard item. Good for identifying top performers. But can be demoralizing for candidates if too many items are this hard.

p = 0.0 (0%) - No one got it right. Either the item is impossibly hard, or there's an error (answer key wrong, question ambiguous).

Optimal Difficulty Distribution

A good exam has a mix of difficulties:

  • 10% very easy (p > 0.9)
  • 25% easy (p = 0.7-0.9)
  • 40% medium (p = 0.4-0.7)
  • 20% hard (p = 0.2-0.4)
  • 5% very hard (p < 0.2)

This distribution provides good discrimination without being demoralizing.

Item Discrimination and Point-Biserial Correlation

Why Discrimination Matters

An item that everyone gets right (p = 1.0) or everyone gets wrong (p = 0.0) provides no information about whether a candidate is high or low ability. Items that discriminate well have high correlation with overall test performance.

Ideal scenario: High-ability candidates (top 25% of test-takers) answer the item correctly frequently, while low-ability candidates (bottom 25%) answer it incorrectly frequently.

Point-Biserial Correlation

The standard discrimination index is point-biserial correlation (r_pb), which measures the correlation between item performance (correct/incorrect) and total test score.

Formula (simplified):

r_pb = (M_H - M_L) / SD_total × √(p(1-p))

Where:
  M_H = mean total score of those who got item correct
  M_L = mean total score of those who got item incorrect
  SD_total = standard deviation of total test scores
  p = item difficulty (prop correct)

Interpretation:

r_pb Value Discrimination Quality Action
r_pb > 0.4 Excellent Keep item. This discriminates well.
r_pb = 0.2 to 0.4 Good Acceptable. Consider revising if other red flags.
r_pb = 0.0 to 0.2 Fair Weak discrimination. Review for ambiguity or technical errors.
r_pb < 0.0 Negative (Poor) Red flag. Low scorers answer correctly, high scorers don't. Likely error in answer key or item is confusing.

Target: r_pb > 0.3 for all items. Items with r_pb < 0.2 should be flagged for review.

Item Characteristic Curves: 1PL, 2PL, 3PL Models

What Is an ICC?

An Item Characteristic Curve (ICC) is a graph showing the relationship between ability level (x-axis) and probability of getting the item correct (y-axis). Different models fit different curves to the data.

1PL (Rasch) Model

Assumption: All items have the same discriminating power. Difficulty is the only parameter that varies.

Curve shape: All items have the same S-shape (sigmoidal), just shifted left (easy) or right (hard).

Use case: Quick, simple item analysis. Good for classroom exams with many items.

Limitation: Unrealistic—in practice, some items discriminate better than others.

2PL (Two-Parameter Logistic) Model

Assumptions: Each item has two parameters: difficulty (b) and discrimination (a). The S-curve is steeper (more discriminating) for items with high a.

Curve shape: Steep S-curves for high-discrimination items, shallow for low-discrimination items.

Use case: Most realistic for real exams. This is standard in psychometrics.

How to interpret: An item with steep curve (high a) discriminates well. Items with shallow curves discriminate poorly.

3PL (Three-Parameter Logistic) Model

Assumptions: Adds a third parameter c (pseudo-guessing), which represents the probability that even a low-ability candidate will guess correctly.

Curve shape: Doesn't approach 0 at low ability levels; instead approaches the guessing parameter (e.g., for 4-choice multiple choice, c ≈ 0.25).

Use case: Multiple-choice exams where guessing is a significant factor. Less relevant for constructed-response or practical exams.

When to use 3PL: If you suspect candidates are guessing. If not, 2PL is simpler and equally valid.

Differential Item Functioning (DIF) and Bias Detection

What Is DIF?

DIF occurs when an item functions differently for different demographic groups (gender, race, native language, disability status) even when overall ability is matched. In other words: high-ability candidates from Group A answer the item correctly more often than high-ability candidates from Group B.

Example: An item about fishing might disadvantage urban candidates from groups with less fishing culture, even if they're equally knowledgeable in other areas. The item isn't testing fishing knowledge; it's testing cultural familiarity.

Detecting DIF: Mantel-Haenszel Test

The Mantel-Haenszel (MH) test is a standard statistical approach to detect DIF. It compares item performance across groups, controlling for overall ability.

How it works (conceptually):

  1. Split candidates into ability levels (e.g., total score quartiles)
  2. Within each ability level, compare performance on item X between groups
  3. If Group A consistently outperforms Group B on item X (even within same ability level), the item shows DIF against Group B

MH Odds Ratio interpretation:

  • OR = 1.0: No DIF. Both groups equally likely to get item correct at same ability level.
  • OR = 1.5+: Moderate DIF (in favor of reference group). Item functions better for reference group.
  • OR = 2.0+: Large DIF. Item may be biased.

DIF Does Not Mean Bias (Important Distinction)

DIF is a statistical finding: Item functions differently for different groups.

Bias is a value judgment: The item is unfair or discriminatory.

Not all DIF is bias. Example: A fire safety question might show DIF for candidates from countries with different building codes, but that's legitimate if the exam is for a certification that requires knowledge of current building codes.

But consistent, large DIF on multiple items across multiple demographic groups is a red flag for bias and warrants investigation and potential removal or revision of items.

Distractor Analysis for Multiple Choice

What Is a Distractor and Why Analyze Them?

In multiple-choice items, the wrong answer options are called distractors. A good distractor is:

  • Plausible (attractive to low-ability candidates who haven't mastered the concept)
  • Chosen by some candidates (if no one chooses it, it's wasting an option)
  • Chosen primarily by low-ability candidates (not by high-ability candidates)

Distractor Analysis Process

Step 1: For each option (including correct answer), calculate % of test-takers who chose it.

Step 2: Among high-ability candidates (top quartile by total score), what % chose each option?

Step 3: Among low-ability candidates (bottom quartile), what % chose each option?

Example item (#12) analysis:

Question: What is 12 × 13?

A) 156 (correct)
B) 155 (off-by-one error)
C) 166 (multiplication error)
D) 100 (completely wrong)

Option Overall % High Ability % Low Ability % Quality
A (correct) 82% 98% 65% Excellent
B (off-by-one) 12% 2% 22% Good distractor
C (mult error) 4% 0% 8% Okay
D (wrong) 2% 0% 5% Poor (too obvious)

Interpretation: Option B is an excellent distractor—it appeals to low-ability candidates (22% choose it) but rarely to high-ability candidates (2%). Option D is poor—almost no one chooses it, so it's a wasted option slot. Consider replacing D.

Red flags in distractor analysis:

  • Distractor chosen by MORE high-ability candidates than low-ability ones (suggests that option might be correct, or correct answer is ambiguous)
  • Distractor never chosen (<0.5% overall) - replace this option
  • Two options equally attractive to both groups - item is ambiguous

Building and Maintaining an Item Bank

What Is an Item Bank?

An item bank is a curated database of validated items that have been proven to function well through item analysis. Instead of using items once and discarding them, you maintain a bank of 200-500 tested items and draw from it for exams.

Benefits:

  • Quality assurance: You know exactly how hard each item is, how well it discriminates, whether it has bias. You use proven items, not untested ones.
  • Ability to rotate items: Use different items in different testing windows, reducing cheating and allowing test security.
  • Efficiency: Build a bank once, use it many times. Each new item costs money (writing, reviewing, piloting); amortize this cost across many exams.
  • Equating: If you use items with known difficulty and discrimination, you can equate exam scores across testing windows (a exam on date A is equivalent to exam on date B).

Item Bank Structure

ITEM BANK DATABASE
Item ID | Domain | Difficulty | Discrimination | Format | Status | Notes

IT-001 | Python Basics | p=0.75 | r=0.42 | MCQ | Active | Excellent item
IT-002 | Python Basics | p=0.35 | r=0.38 | MCQ | Active | Good discrimination
IT-003 | Python Basics | p=0.92 | r=0.12 | MCQ | Revision | Too easy, weak discrimination
IT-004 | OOP Concepts | p=0.45 | r=0.51 | Short | Active | High discrimination
IT-005 | OOP Concepts | p=0.20 | r=-0.05 | MCQ | Retired | Negative discrimination—likely flawed

Item statuses:

  • Active: Proven to work. Can be used in exams immediately.
  • Calibrating: Newly written, currently being tested on sample of candidates. Awaiting analysis data.
  • Revision: Needs work (too easy, weak discrimination, DIF detected). Sent back for rewriting.
  • Retired: Was active but showing problems after extensive use (p has drifted, or new evidence of bias). Removed from active circulation.

Item Revision Workflow

How to Identify Underperforming Items

After each exam administration:

  1. Calculate item statistics (difficulty, discrimination, DIF) for all items
  2. Flag items with:
    • p < 0.2 or p > 0.9 (too easy or too hard)
    • r_pb < 0.2 (weak discrimination)
    • r_pb < 0 (negative discrimination—major red flag)
    • Significant DIF (MH odds ratio > 1.5)
  3. Prioritize for revision: negative discrimination > large DIF > weak discrimination > extreme difficulty

Diagnosis: Why Is the Item Underperforming?

If p > 0.9 (too easy): Either the item is trivial, or the content is so important everyone knows it. Revision: Increase difficulty by rewording, adding a twist, or making options more plausible.

If p < 0.2 (too hard): Either the item is testing something advanced, or the wording is confusing. Review: Does the difficulty match the intended target audience? If so, keep. If not, simplify wording.

If r_pb < 0 (negative discrimination): This is bad. High-ability candidates are getting it wrong. Likely causes: answer key is wrong, question wording is ambiguous, or correct answer is actually less correct than a distractor. Action: Review answer key first, then item wording.

If significant DIF: Investigate whether the difference is due to legitimate content knowledge or due to cultural/linguistic bias. For bias, revise or remove the item.

Revision Strategies

For items with weak discrimination (r_pb = 0.1-0.2):

  • Reword to make the correct answer more clearly correct (without giving away the answer)
  • Make distractors more plausible (currently might be too obviously wrong, everyone gets it right or wrong)
  • Increase technical difficulty (e.g., "What is 12 × 13?" → "What is 12 × 13 − 5?"), but keep conceptual difficulty the same

For items with DIF:

  • Remove or replace any cultural references that aren't essential to the concept
  • Simplify language (especially for items with DIF between native speakers and ESL candidates)
  • Broaden context (instead of "fishing," use "outdoor recreational activity")
  • If DIF is legitimate (item is testing contextual knowledge that's actually relevant), document this and move on

Applied Case Study: 30-Item Exam Analysis

Scenario

Exam: Entry-level certification in data analysis (30 multiple-choice questions, 60 min time limit). Target: pass rate 70% (passing score 21/30).

Candidates: 250 candidates from diverse backgrounds.

Goal: Analyze item functioning and revise items before using exam for high-stakes decisions.

Key Results

Exam-level statistics:

  • Mean score: 21.2/30 (70.7%)
  • SD: 4.1
  • Reliability (Cronbach's alpha): 0.81 (good)
  • Item count adequate (30 items for 0.80+ reliability at this SD)

Item-by-item results (summary):

Item p-value r_pb Action
Q1-Q5 0.82-0.91 0.38-0.45 Keep (good easy items)
Q6-Q20 0.35-0.68 0.32-0.58 Keep (all excellent)
Q21-Q25 0.15-0.28 0.25-0.42 Keep (good hard items)
Q26 0.12 0.08 Revise (too hard + weak discrimination)
Q27 0.48 -0.12 Urgent: negative discrimination. Review answer key.
Q28-Q30 0.18-0.24 0.35-0.41 Keep (good hard items)

DIF analysis (comparing gender): Item Q15 shows significant DIF (MH OR = 1.8) favoring male candidates. Content: "A company's quarterly earnings have increased by 25%. Calculate the YoY growth rate." Hypothesis: item wording favors those familiar with finance terminology. Revision: Simplify to "A store's sales increased 25% this quarter compared to last quarter. If this trend continues, by what % will sales increase over a year?"

Recommendations

  • Immediately: Review answer key for Q27 (negative discrimination)
  • Before next administration: Revise Q26 (increase difficulty or improve options), revise Q15 (address potential gender bias)
  • For future editions: Q1-Q30 provides good coverage of difficulty levels and discrimination. Can recycle 27 items; replace only Q26 and Q27.
  • Reliability: Current alpha=0.81 is good. Replacing weak items should maintain or improve this.

Software Tools for Item Analysis

Tool Cost Ease of Use Best For
R (ltm, mirt packages) Free Hard (programming required) Advanced analysis (ICC, DIF). Research.
Python (pyirt) Free Hard (programming required) Same as R. Can integrate into custom pipelines.
Iteman (modern replacements) $300-800 Medium (point-and-click) Classroom testing, CTT (Classical Test Theory) analysis.
FastTest $500-2000 Medium Item bank management, IRT scaling.
Winsteps (Rasch) $300 Medium Rasch (1PL) analysis. Good for classroom use.
SPSS (IRT plugin) $700+/year Easy (GUI) Comprehensive. Overkill if you only need basic item analysis.

Recommendation for most teams: Start with R or Python (free, but requires learning curve). Graduate to FastTest if you manage a large item bank (200+ items).

Summary and Best Practices

Item Analysis Essential Checklist

  • Calculate difficulty (p): Target range 0.3-0.7 for most items. Mix in 20-30% easy items (p > 0.7) and 20-30% hard items (p < 0.3).
  • Calculate discrimination (r_pb): Target r_pb > 0.3. Flag items with r_pb < 0.2 or r_pb < 0 for review.
  • Check for DIF: Use Mantel-Haenszel or logistic regression. Investigate items with large DIF (MH OR > 1.5).
  • Analyze distractors (for MCQ): Good distractors are chosen by low-ability but not high-ability candidates. Replace distractors with <0.5% selection.
  • Build an item bank: Maintain 200-500 validated items for future exams. Track difficulty, discrimination, and DIF for each item.
  • Establish revision workflow: After each exam, identify underperforming items and revise before next use.
  • Use appropriate tools: R/Python for research, FastTest/SPSS for production item banks.
  • Involve subject matter experts in revision: Psychometricians can identify problems, but SMEs understand content and can fix them properly.

Start Your Item Analysis Today

Good items are the foundation of credible assessment. Item analysis is not optional—it's the quality control that separates professional evaluation systems from amateur ones.

Learn More About Item Banks