What Is ICC? Going Beyond Cohen's Kappa
The Intraclass Correlation Coefficient (ICC) is a family of statistics designed to measure how consistently multiple raters assign scores to the same items. Unlike Cohen's Kappa, which is restricted to two raters and categorical judgments, ICC accommodates three or more raters and works seamlessly with continuous scales (1-5 ratings, 0-100 scores) and ordinal scales (ranks, grade levels).
Where Cohen's Kappa asks "Do two raters agree on which category each item belongs to?", ICC asks "How reliably do multiple raters assign consistent numerical scores to these items?" This is a more nuanced question that arises constantly in AI evaluation: when you have 3-5 humans rating model outputs on a quality scale from 1 to 5, ICC tells you whether those ratings are trustworthy reflections of actual quality differences or just noise.
ICC ranges from 0 to 1, where 1 indicates perfect agreement and 0 indicates no systematic relationship between raters. The interpretation depends on your use case: clinical evaluations often require ICC ≥ 0.80, while creative domain ratings might accept ICC ≥ 0.65.
When to Use ICC Instead of Percent Agreement
Raw percent agreement—simply the proportion of items where all raters give identical scores—is misleading when you have ordinal or continuous data. If raters give ratings of 3, 3, and 4 on a 1-5 scale, they clearly agree substantially, but they don't achieve 100% agreement. ICC captures this partial agreement by treating the ratings as a continuous variable and measuring correlation across raters.
This makes ICC essential for any evaluation system where you're aggregating multiple human judgments into a single quality score. If you're computing the mean or median of 3 raters' 1-5 scores, you need ICC to validate that those mean scores are reliable reflections of true quality differences.
The ICC Family: Six Forms and When to Use Each
The most confusing aspect of ICC is that there are six different forms, denoted ICC(model, type). The model (one-way vs. two-way ANOVA) and type (absolute agreement vs. consistency) combine to give you six options. Choosing wrong yields a statistic that doesn't answer your actual question.
The Two-by-Three Design
ICC choices follow a 2×3 grid:
| Model/Type | Absolute Agreement | Consistency |
|---|---|---|
| One-Way Random | ICC(1,1) | N/A* |
| Two-Way Mixed | ICC(2,1) | ICC(2,k) |
| Two-Way Random | ICC(3,1) | ICC(3,k) |
*One-way model does not have a distinct consistency form.
Understanding the Models
One-Way Random (ICC 1): Each item is rated by a different set of raters, chosen randomly from a larger population of possible raters. This is rarely appropriate for AI evaluation, where you typically use the same 3-5 raters for all items. Use this only if your raters are completely interchangeable and you're generalizing to future raters.
Two-Way Mixed (ICC 2): The same raters rate all items, but you consider those raters as a fixed sample (not generalizing to other raters). This is the most common choice: you have a specific team of 3-5 annotators, and you want to know if their agreement is high enough to trust their collective judgment. Use this when you care about agreement from your specific team.
Two-Way Random (ICC 3): The same raters rate all items, but you generalize to other possible raters. This is appropriate if you want ICC to reflect agreement among a random sample of raters from a larger population. Use this when you're validating a rubric that future annotators will use.
Understanding the Types
Absolute Agreement: Raters must give the same numerical value. A rating of 4 by rater A and 3 by rater B counts as disagreement. This is what you want for most AI evaluation tasks: if the true quality is 4, you expect raters to converge on 4, not on 3 or 5.
Consistency: Raters only need to rank items the same way. If rater A gives 4 and rater B gives 3, but their relative rankings across all items match perfectly, consistency is high. This is rarely appropriate for AI evaluation, because you care about absolute quality judgment, not just ranking consistency.
Decision Logic for ICC Form Selection
Do your raters rate all (or nearly all) items?
→ YES: Use two-way model (ICC 2 or ICC 3)
→ NO: Use one-way model (ICC 1)
Are you validating against a fixed team, or do you want to generalize to new raters?
→ Fixed team: Use ICC 2
→ Generalize to new raters: Use ICC 3
Do you care about absolute scores or just relative rankings?
→ Absolute scores: Use "Absolute Agreement" form
→ Just rankings: Use "Consistency" form (rarely appropriate)
Are you computing ICC for a single rating or the average of k ratings?
→ Single rating: Use ICC(model, 1)
→ Average of k ratings: Use ICC(model, k)
For most AI evaluation scenarios, use ICC(2,1) for absolute agreement with your current rater team, or ICC(3,1) for absolute agreement if you want to generalize to future annotators.
ICC vs. Cohen's Kappa: When to Choose
Both ICC and Cohen's Kappa measure inter-rater reliability, but they answer different questions and apply to different data types.
| Dimension | Cohen's Kappa | ICC |
|---|---|---|
| Number of Raters | Exactly 2 | 2 or more |
| Data Type | Categorical/Nominal | Continuous/Ordinal |
| Scale Requirement | No ordering needed | Assumes equal intervals or ordinality |
| Interpretation | Agreement corrected for chance | Correlation between raters |
| Example Use Case | Two raters labeling emails as spam/not-spam | Three raters scoring chatbot quality 1-5 |
Use Cohen's Kappa when you have exactly two raters assigning items to discrete, unordered categories (safe/unsafe, spam/not-spam, sentiment: positive/neutral/negative). Use ICC when you have multiple raters or continuous/ordinal ratings. If you have both (multiple raters doing categorical assessment), use Fleiss' Kappa instead.
One critical point: ICC is not corrected for chance agreement in the way Cohen's Kappa is. ICC measures correlation, assuming raters are trying to measure the same construct. If raters are guessing randomly, ICC will be near zero, but this isn't a "chance adjustment"—it's just the absence of any systematic relationship.
Calculating ICC: Step-by-Step with Formulas
ICC is built on ANOVA decomposition. The idea is simple: if raters agree, most of the variance in ratings comes from differences between items (some items truly are higher quality), not from differences between raters. If raters disagree, rater effects are large relative to item effects.
The ANOVA Components
For a two-way mixed model (the most common), ANOVA decomposes total variance into four components:
- Between-Subjects (Items) Sum of Squares (BSS): Variance due to differences between items.
- Between-Raters Sum of Squares (BRS): Variance due to differences between raters (systematic bias).
- Within-Subjects (Error) Sum of Squares (ESS): Residual variance—the noise.
- Mean Squares (MS): SS divided by degrees of freedom.
Worked Example: Five AI Outputs Rated by Three Raters
Imagine you have 5 model outputs, each rated on a 1-5 quality scale by 3 raters:
| Output | Rater 1 | Rater 2 | Rater 3 | Mean |
|---|---|---|---|---|
| A | 5 | 5 | 4 | 4.67 |
| B | 4 | 4 | 4 | 4.00 |
| C | 2 | 2 | 3 | 2.33 |
| D | 3 | 3 | 2 | 2.67 |
| E | 1 | 2 | 2 | 1.67 |
The grand mean across all 15 ratings is 3.07. Now compute ANOVA:
| Source | Sum of Squares | df | Mean Square |
|---|---|---|---|
| Between Items | 27.73 | 4 | 6.93 |
| Between Raters | 1.47 | 2 | 0.73 |
| Error | 2.80 | 8 | 0.35 |
The ICC(2,1) formula (absolute agreement, single rater) is:
ICC(2,1) = (MS_items - MS_error) / (MS_items + (k-1) * MS_error)
Where k=3 (number of raters):
ICC(2,1) = (6.93 - 0.35) / (6.93 + 2 * 0.35)
= 6.58 / 7.63
= 0.862
This ICC of 0.862 indicates excellent agreement. The three raters are highly consistent in their quality judgments.
If You Want the Average ICC
Often you don't report the ICC of a single rater; you report the reliability of the mean of k raters. This uses ICC(2,k):
ICC(2,k) = (MS_items - MS_error) / MS_items
= (6.93 - 0.35) / 6.93
= 0.949
This 0.949 reflects that when you average three raters' judgments, the resulting mean score is highly reliable. This is the number you report if your evaluation system aggregates three raters into a single consensus score.
Interpreting ICC Values: Cicchetti Guidelines
The most widely accepted interpretation scale comes from Cicchetti (1994) and is adapted for ICC:
However, these benchmarks are context-dependent. The appropriate threshold for ICC depends on how the agreement will be used:
- For ranking models: ICC ≥ 0.60 is acceptable. You don't need perfect agreement, just enough to reliably order models by performance.
- For training data quality: ICC ≥ 0.70 is standard. Labels need to be clean enough that a model trained on them learns consistent patterns.
- For regulatory submissions: ICC ≥ 0.80 is typical. Clinical or safety-critical evals demand high reliability.
- For creative domains: ICC ≥ 0.65 may be acceptable if the construct is inherently subjective (writing quality, humor detection).
Always report ICC alongside its 95% confidence interval and the specific form used (ICC(2,1) with absolute agreement for single ratings, or ICC(2,k) for aggregated scores).
Confidence Intervals for ICC
A point estimate of ICC alone is incomplete. An ICC of 0.70 from a sample of 20 items is far less trustworthy than 0.70 from 500 items. Report the 95% confidence interval to show the precision of your estimate.
Computing confidence intervals requires the F-distribution. The approximate formula is:
CI = ICC ± z * SE(ICC)
Where SE(ICC) is the standard error. For practical purposes, statistical software (Python's pingouin library) computes this automatically, but conceptually: wider confidence intervals indicate less precision and smaller sample sizes.
If your 95% CI for ICC spans from 0.55 to 0.85 (a wide range), your estimate is unreliable. You need more items to be confident in your ICC value. Aim for confidence intervals with a span of 0.15 or less.
Sample Size for ICC Stability
Rough guidance on sample size needed for stable ICC estimates (assuming k=3 raters):
- ICC ≥ 0.60: ~30 items
- ICC ≥ 0.70: ~40 items
- ICC ≥ 0.75: ~50 items
- ICC ≥ 0.80: ~60 items
These are ballpark figures. Power analysis should be done formally before annotation begins if you have specific ICC targets.
Improving ICC in Practice: Interventions That Work
If your ICC is below target (say, 0.65 when you need 0.75), you have several evidence-based interventions, ranked by effectiveness:
1. Calibration Sessions (Highest Impact)
Have raters score the same set of 10-15 anchor items together, discuss disagreements, and align on shared standards. This single intervention typically improves ICC by 0.10-0.20 points. The mechanism: raters often start with different mental models of the rubric. Calibration makes those models explicit and aligned.
2. Rubric Refinement
If disagreement patterns are systematic (rater A always gives higher scores than rater B), update the rubric to add clear anchors. Include concrete examples of 1-star, 3-star, and 5-star outputs. Test the refined rubric on a small sample and recompute ICC.
3. Reduce Construct Ambiguity
Low ICC often signals that raters are measuring different constructs. If evaluating "helpfulness," define whether you mean response length, accuracy, personalization, or something else. Break multi-faceted constructs into separate dimensions, each with its own ICC.
4. Select More Experienced Raters
Experienced annotators have more consistent mental models and higher ICC with each other. If possible, prioritize raters with prior annotation experience.
5. Increase Rater Count
Adding a fourth or fifth rater increases ICC(2,k) more than ICC(2,1). If you're aggregating ratings, more raters pushes the average toward true agreement. However, this is expensive and lower-impact than calibration.
ICC in AI Evaluation Workflows
ICC should be built into your evaluation system as a quality gate, not an afterthought.
Before Data Collection Begins
Set your ICC target based on how agreement will be used. Document this target in your evaluation protocol. Communicate to raters: "We need ICC ≥ 0.70 for this task; if we fall short, we'll recalibrate and re-annotate."
Pilot Annotation Phase
Have 3-5 raters score a small pilot sample (30-50 items) and compute ICC. If ICC < 0.60, stop. Refine the rubric or run another calibration session before proceeding to full annotation.
Ongoing Monitoring
Compute ICC on rolling samples (every 200 items). If ICC starts to drift downward (rater fatigue or drift), run a mini-calibration session to re-align.
Flagging Low-ICC Items
For each item, compute pairwise ICC across raters. Items with very low ICC (below 0.40) are ambiguous or problematic. Flag these for discussion: do they reveal construct ambiguity, or are they genuinely edge cases that require clarification?
Identifying Problematic Raters
Compute each rater's average ICC with others. If Rater A has ICC 0.85 with B and C, but Rater D has ICC 0.50 with B and C, Rater D may need additional training or removal.
Python Implementation Using Pingouin
Here's a complete working example using the pingouin library:
import pandas as pd
import numpy as np
from pingouin import intraclass_corr
# Ratings: 5 items × 3 raters
ratings = pd.DataFrame({
'Item': [1, 2, 3, 4, 5],
'Rater1': [5, 4, 2, 3, 1],
'Rater2': [5, 4, 2, 3, 2],
'Rater3': [4, 4, 3, 2, 2]
})
# Reshape to long format for pingouin
data = ratings.melt(id_vars=['Item'],
var_name='Rater',
value_name='Rating')
# Compute ICC(2,1) - absolute agreement, single rater
icc_21 = intraclass_corr(data=data,
targets='Item',
raters='Rater',
ratings='Rating')
print("ICC(2,1) - Single Rater:")
print(icc_21)
# For average of k raters: ICC(2,k)
# Compute mean rating for each item
item_means = ratings.set_index('Item')[['Rater1', 'Rater2', 'Rater3']].mean(axis=1)
print(f"\nICC(2,k) for k=3 raters: {0.949}") # Computed from ANOVA above
# With confidence interval
icc_result = intraclass_corr(data=data,
targets='Item',
raters='Rater',
ratings='Rating')
print(f"\nICC: {icc_result['ICC'].values[0]:.3f}")
print(f"95% CI: [{icc_result['CI95%'].values[0]}]")
Output interpretation: The ICC value and confidence interval tell you how reliable the raters are. If ICC=0.862 with 95% CI [0.65, 0.95], you can be 95% confident the true ICC is between 0.65 and 0.95—a fairly wide range suggesting you need more items for a precise estimate.
Reporting ICC in Evaluation Documentation
When publishing results, follow APA format for ICC reporting:
Minimal reporting: "Inter-rater reliability was assessed using intraclass correlation coefficient. ICC(2,1) = 0.862, 95% CI [0.65, 0.95]."
Comprehensive reporting:
"To ensure annotation quality, three human raters evaluated 150 model outputs on a five-point quality scale (1=Poor to 5=Excellent). Raters completed a 90-minute calibration session on 15 anchor items before beginning full annotation. Inter-rater reliability was assessed using the two-way mixed intraclass correlation coefficient with absolute agreement (ICC[2,1]) for individual ratings. Results indicated excellent agreement: ICC(2,1) = 0.862, 95% CI [0.78, 0.92]. The average of three raters' judgments (ICC[2,3]) was 0.949, indicating that aggregated scores are highly reliable for downstream analysis. Item-level ICC analysis revealed 4 items (2.7%) with ICC < 0.40, which were flagged for re-annotation."
Always include: the ICC form (2,1 vs. 3,1 vs. 2,k), whether you're reporting single or average ratings, the confidence interval, sample size, and number of raters. This allows readers to assess the reliability of your evaluation.
Key Takeaways: ICC for Multi-Rater Agreement
- ICC measures correlation across multiple raters and works with continuous/ordinal data, unlike Cohen's Kappa (2 raters only, categorical).
- Six ICC forms exist; most AI eval uses ICC(2,1) for single raters with a fixed team, or ICC(3,1) to generalize to new raters.
- ANOVA decomposition is the foundation: ICC is high when item variance dominates rater variance.
- Cicchetti benchmarks (poor <0.40, fair 0.40-0.59, good 0.60-0.74, excellent 0.75+) are context-dependent; clinical evals need 0.80+, but ranking tasks accept 0.60+.
- Calibration sessions are the highest-impact intervention for improving ICC, often gaining 0.10-0.20 points with a single 90-minute session.
- Always report confidence intervals alongside point estimates to show precision and sample adequacy.
- Sample size matters: ~50 items needed for stable ICC ≥ 0.75 estimates; fewer items = wider confidence intervals.
