What Is ICC? Going Beyond Cohen's Kappa

The Intraclass Correlation Coefficient (ICC) is a family of statistics designed to measure how consistently multiple raters assign scores to the same items. Unlike Cohen's Kappa, which is restricted to two raters and categorical judgments, ICC accommodates three or more raters and works seamlessly with continuous scales (1-5 ratings, 0-100 scores) and ordinal scales (ranks, grade levels).

Where Cohen's Kappa asks "Do two raters agree on which category each item belongs to?", ICC asks "How reliably do multiple raters assign consistent numerical scores to these items?" This is a more nuanced question that arises constantly in AI evaluation: when you have 3-5 humans rating model outputs on a quality scale from 1 to 5, ICC tells you whether those ratings are trustworthy reflections of actual quality differences or just noise.

ICC ranges from 0 to 1, where 1 indicates perfect agreement and 0 indicates no systematic relationship between raters. The interpretation depends on your use case: clinical evaluations often require ICC ≥ 0.80, while creative domain ratings might accept ICC ≥ 0.65.

0.92
Excellent Agreement
0.74
Good Agreement
0.55
Fair Agreement
0.35
Poor Agreement

When to Use ICC Instead of Percent Agreement

Raw percent agreement—simply the proportion of items where all raters give identical scores—is misleading when you have ordinal or continuous data. If raters give ratings of 3, 3, and 4 on a 1-5 scale, they clearly agree substantially, but they don't achieve 100% agreement. ICC captures this partial agreement by treating the ratings as a continuous variable and measuring correlation across raters.

This makes ICC essential for any evaluation system where you're aggregating multiple human judgments into a single quality score. If you're computing the mean or median of 3 raters' 1-5 scores, you need ICC to validate that those mean scores are reliable reflections of true quality differences.

The ICC Family: Six Forms and When to Use Each

The most confusing aspect of ICC is that there are six different forms, denoted ICC(model, type). The model (one-way vs. two-way ANOVA) and type (absolute agreement vs. consistency) combine to give you six options. Choosing wrong yields a statistic that doesn't answer your actual question.

The Two-by-Three Design

ICC choices follow a 2×3 grid:

Model/Type Absolute Agreement Consistency
One-Way Random ICC(1,1) N/A*
Two-Way Mixed ICC(2,1) ICC(2,k)
Two-Way Random ICC(3,1) ICC(3,k)

*One-way model does not have a distinct consistency form.

Understanding the Models

One-Way Random (ICC 1): Each item is rated by a different set of raters, chosen randomly from a larger population of possible raters. This is rarely appropriate for AI evaluation, where you typically use the same 3-5 raters for all items. Use this only if your raters are completely interchangeable and you're generalizing to future raters.

Two-Way Mixed (ICC 2): The same raters rate all items, but you consider those raters as a fixed sample (not generalizing to other raters). This is the most common choice: you have a specific team of 3-5 annotators, and you want to know if their agreement is high enough to trust their collective judgment. Use this when you care about agreement from your specific team.

Two-Way Random (ICC 3): The same raters rate all items, but you generalize to other possible raters. This is appropriate if you want ICC to reflect agreement among a random sample of raters from a larger population. Use this when you're validating a rubric that future annotators will use.

Understanding the Types

Absolute Agreement: Raters must give the same numerical value. A rating of 4 by rater A and 3 by rater B counts as disagreement. This is what you want for most AI evaluation tasks: if the true quality is 4, you expect raters to converge on 4, not on 3 or 5.

Consistency: Raters only need to rank items the same way. If rater A gives 4 and rater B gives 3, but their relative rankings across all items match perfectly, consistency is high. This is rarely appropriate for AI evaluation, because you care about absolute quality judgment, not just ranking consistency.

Decision Logic for ICC Form Selection

Do your raters rate all (or nearly all) items?

→ YES: Use two-way model (ICC 2 or ICC 3)
→ NO: Use one-way model (ICC 1)

Are you validating against a fixed team, or do you want to generalize to new raters?

→ Fixed team: Use ICC 2
→ Generalize to new raters: Use ICC 3

Do you care about absolute scores or just relative rankings?

→ Absolute scores: Use "Absolute Agreement" form
→ Just rankings: Use "Consistency" form (rarely appropriate)

Are you computing ICC for a single rating or the average of k ratings?

→ Single rating: Use ICC(model, 1)
→ Average of k ratings: Use ICC(model, k)

For most AI evaluation scenarios, use ICC(2,1) for absolute agreement with your current rater team, or ICC(3,1) for absolute agreement if you want to generalize to future annotators.

ICC vs. Cohen's Kappa: When to Choose

Both ICC and Cohen's Kappa measure inter-rater reliability, but they answer different questions and apply to different data types.

Dimension Cohen's Kappa ICC
Number of Raters Exactly 2 2 or more
Data Type Categorical/Nominal Continuous/Ordinal
Scale Requirement No ordering needed Assumes equal intervals or ordinality
Interpretation Agreement corrected for chance Correlation between raters
Example Use Case Two raters labeling emails as spam/not-spam Three raters scoring chatbot quality 1-5

Use Cohen's Kappa when you have exactly two raters assigning items to discrete, unordered categories (safe/unsafe, spam/not-spam, sentiment: positive/neutral/negative). Use ICC when you have multiple raters or continuous/ordinal ratings. If you have both (multiple raters doing categorical assessment), use Fleiss' Kappa instead.

One critical point: ICC is not corrected for chance agreement in the way Cohen's Kappa is. ICC measures correlation, assuming raters are trying to measure the same construct. If raters are guessing randomly, ICC will be near zero, but this isn't a "chance adjustment"—it's just the absence of any systematic relationship.

Calculating ICC: Step-by-Step with Formulas

ICC is built on ANOVA decomposition. The idea is simple: if raters agree, most of the variance in ratings comes from differences between items (some items truly are higher quality), not from differences between raters. If raters disagree, rater effects are large relative to item effects.

The ANOVA Components

For a two-way mixed model (the most common), ANOVA decomposes total variance into four components:

Worked Example: Five AI Outputs Rated by Three Raters

Imagine you have 5 model outputs, each rated on a 1-5 quality scale by 3 raters:

Output Rater 1 Rater 2 Rater 3 Mean
A 5 5 4 4.67
B 4 4 4 4.00
C 2 2 3 2.33
D 3 3 2 2.67
E 1 2 2 1.67

The grand mean across all 15 ratings is 3.07. Now compute ANOVA:

Source Sum of Squares df Mean Square
Between Items 27.73 4 6.93
Between Raters 1.47 2 0.73
Error 2.80 8 0.35

The ICC(2,1) formula (absolute agreement, single rater) is:

ICC(2,1) = (MS_items - MS_error) / (MS_items + (k-1) * MS_error)

Where k=3 (number of raters):

ICC(2,1) = (6.93 - 0.35) / (6.93 + 2 * 0.35)
         = 6.58 / 7.63
         = 0.862

This ICC of 0.862 indicates excellent agreement. The three raters are highly consistent in their quality judgments.

If You Want the Average ICC

Often you don't report the ICC of a single rater; you report the reliability of the mean of k raters. This uses ICC(2,k):

ICC(2,k) = (MS_items - MS_error) / MS_items
         = (6.93 - 0.35) / 6.93
         = 0.949

This 0.949 reflects that when you average three raters' judgments, the resulting mean score is highly reliable. This is the number you report if your evaluation system aggregates three raters into a single consensus score.

Interpreting ICC Values: Cicchetti Guidelines

The most widely accepted interpretation scale comes from Cicchetti (1994) and is adapted for ICC:

< 0.40
Poor Agreement
0.40–0.59
Fair Agreement
0.60–0.74
Good Agreement
0.75–1.00
Excellent Agreement

However, these benchmarks are context-dependent. The appropriate threshold for ICC depends on how the agreement will be used:

Always report ICC alongside its 95% confidence interval and the specific form used (ICC(2,1) with absolute agreement for single ratings, or ICC(2,k) for aggregated scores).

Confidence Intervals for ICC

A point estimate of ICC alone is incomplete. An ICC of 0.70 from a sample of 20 items is far less trustworthy than 0.70 from 500 items. Report the 95% confidence interval to show the precision of your estimate.

Computing confidence intervals requires the F-distribution. The approximate formula is:

CI = ICC ± z * SE(ICC)

Where SE(ICC) is the standard error. For practical purposes, statistical software (Python's pingouin library) computes this automatically, but conceptually: wider confidence intervals indicate less precision and smaller sample sizes.

If your 95% CI for ICC spans from 0.55 to 0.85 (a wide range), your estimate is unreliable. You need more items to be confident in your ICC value. Aim for confidence intervals with a span of 0.15 or less.

Sample Size for ICC Stability

Rough guidance on sample size needed for stable ICC estimates (assuming k=3 raters):

These are ballpark figures. Power analysis should be done formally before annotation begins if you have specific ICC targets.

Improving ICC in Practice: Interventions That Work

If your ICC is below target (say, 0.65 when you need 0.75), you have several evidence-based interventions, ranked by effectiveness:

1. Calibration Sessions (Highest Impact)

Have raters score the same set of 10-15 anchor items together, discuss disagreements, and align on shared standards. This single intervention typically improves ICC by 0.10-0.20 points. The mechanism: raters often start with different mental models of the rubric. Calibration makes those models explicit and aligned.

2. Rubric Refinement

If disagreement patterns are systematic (rater A always gives higher scores than rater B), update the rubric to add clear anchors. Include concrete examples of 1-star, 3-star, and 5-star outputs. Test the refined rubric on a small sample and recompute ICC.

3. Reduce Construct Ambiguity

Low ICC often signals that raters are measuring different constructs. If evaluating "helpfulness," define whether you mean response length, accuracy, personalization, or something else. Break multi-faceted constructs into separate dimensions, each with its own ICC.

4. Select More Experienced Raters

Experienced annotators have more consistent mental models and higher ICC with each other. If possible, prioritize raters with prior annotation experience.

5. Increase Rater Count

Adding a fourth or fifth rater increases ICC(2,k) more than ICC(2,1). If you're aggregating ratings, more raters pushes the average toward true agreement. However, this is expensive and lower-impact than calibration.

Common Pitfall: Assuming low ICC means raters are careless. Often it means the construct is genuinely contested or the rubric is ambiguous. Don't blame raters; fix the rubric.

ICC in AI Evaluation Workflows

ICC should be built into your evaluation system as a quality gate, not an afterthought.

Before Data Collection Begins

Set your ICC target based on how agreement will be used. Document this target in your evaluation protocol. Communicate to raters: "We need ICC ≥ 0.70 for this task; if we fall short, we'll recalibrate and re-annotate."

Pilot Annotation Phase

Have 3-5 raters score a small pilot sample (30-50 items) and compute ICC. If ICC < 0.60, stop. Refine the rubric or run another calibration session before proceeding to full annotation.

Ongoing Monitoring

Compute ICC on rolling samples (every 200 items). If ICC starts to drift downward (rater fatigue or drift), run a mini-calibration session to re-align.

Flagging Low-ICC Items

For each item, compute pairwise ICC across raters. Items with very low ICC (below 0.40) are ambiguous or problematic. Flag these for discussion: do they reveal construct ambiguity, or are they genuinely edge cases that require clarification?

Identifying Problematic Raters

Compute each rater's average ICC with others. If Rater A has ICC 0.85 with B and C, but Rater D has ICC 0.50 with B and C, Rater D may need additional training or removal.

Python Implementation Using Pingouin

Here's a complete working example using the pingouin library:

import pandas as pd
import numpy as np
from pingouin import intraclass_corr

# Ratings: 5 items × 3 raters
ratings = pd.DataFrame({
    'Item': [1, 2, 3, 4, 5],
    'Rater1': [5, 4, 2, 3, 1],
    'Rater2': [5, 4, 2, 3, 2],
    'Rater3': [4, 4, 3, 2, 2]
})

# Reshape to long format for pingouin
data = ratings.melt(id_vars=['Item'],
                    var_name='Rater',
                    value_name='Rating')

# Compute ICC(2,1) - absolute agreement, single rater
icc_21 = intraclass_corr(data=data,
                         targets='Item',
                         raters='Rater',
                         ratings='Rating')

print("ICC(2,1) - Single Rater:")
print(icc_21)

# For average of k raters: ICC(2,k)
# Compute mean rating for each item
item_means = ratings.set_index('Item')[['Rater1', 'Rater2', 'Rater3']].mean(axis=1)
print(f"\nICC(2,k) for k=3 raters: {0.949}")  # Computed from ANOVA above

# With confidence interval
icc_result = intraclass_corr(data=data,
                             targets='Item',
                             raters='Rater',
                             ratings='Rating')
print(f"\nICC: {icc_result['ICC'].values[0]:.3f}")
print(f"95% CI: [{icc_result['CI95%'].values[0]}]")

Output interpretation: The ICC value and confidence interval tell you how reliable the raters are. If ICC=0.862 with 95% CI [0.65, 0.95], you can be 95% confident the true ICC is between 0.65 and 0.95—a fairly wide range suggesting you need more items for a precise estimate.

Reporting ICC in Evaluation Documentation

When publishing results, follow APA format for ICC reporting:

Minimal reporting: "Inter-rater reliability was assessed using intraclass correlation coefficient. ICC(2,1) = 0.862, 95% CI [0.65, 0.95]."

Comprehensive reporting:

"To ensure annotation quality, three human raters evaluated 150 model outputs on a five-point quality scale (1=Poor to 5=Excellent). Raters completed a 90-minute calibration session on 15 anchor items before beginning full annotation. Inter-rater reliability was assessed using the two-way mixed intraclass correlation coefficient with absolute agreement (ICC[2,1]) for individual ratings. Results indicated excellent agreement: ICC(2,1) = 0.862, 95% CI [0.78, 0.92]. The average of three raters' judgments (ICC[2,3]) was 0.949, indicating that aggregated scores are highly reliable for downstream analysis. Item-level ICC analysis revealed 4 items (2.7%) with ICC < 0.40, which were flagged for re-annotation."

Always include: the ICC form (2,1 vs. 3,1 vs. 2,k), whether you're reporting single or average ratings, the confidence interval, sample size, and number of raters. This allows readers to assess the reliability of your evaluation.

Pro Tip: If using ICC(2,k) for aggregated scores, also report ICC(2,1) for transparency. Reviewers understand that ICC improves with averaging, so showing both numbers demonstrates you're not inflating the statistic.

Key Takeaways: ICC for Multi-Rater Agreement

  • ICC measures correlation across multiple raters and works with continuous/ordinal data, unlike Cohen's Kappa (2 raters only, categorical).
  • Six ICC forms exist; most AI eval uses ICC(2,1) for single raters with a fixed team, or ICC(3,1) to generalize to new raters.
  • ANOVA decomposition is the foundation: ICC is high when item variance dominates rater variance.
  • Cicchetti benchmarks (poor <0.40, fair 0.40-0.59, good 0.60-0.74, excellent 0.75+) are context-dependent; clinical evals need 0.80+, but ranking tasks accept 0.60+.
  • Calibration sessions are the highest-impact intervention for improving ICC, often gaining 0.10-0.20 points with a single 90-minute session.
  • Always report confidence intervals alongside point estimates to show precision and sample adequacy.
  • Sample size matters: ~50 items needed for stable ICC ≥ 0.75 estimates; fewer items = wider confidence intervals.