Introduction: Why Cut Scores Matter
Borderline-Group Standard Setting is the most scientifically rigorous method for establishing pass/fail cut scores in high-stakes evaluation contexts. This technique is used across medical licensing, legal bar exams, and increasingly in AI evaluation where safety and performance thresholds are critical.
A cut score—the minimum score required to pass—is arguably the most consequential decision in any evaluation program. Get it wrong by even 5 points, and you could be certifying unqualified practitioners or blocking qualified ones. Yet most organizations set cut scores through ad hoc methods: executive opinion, arbitrary percentile targets, or historical precedent.
Borderline-Group Standard Setting changes this. It's a systematic approach where subject matter experts (SMEs) identify and analyze borderline performers—those right at the boundary of competence—and use that data to set defensible cut scores. When properly executed, this method withstands legal challenge, maintains stakeholder confidence, and most importantly, serves the public interest.
Framework & Architecture
Standard setting is not a single moment but a systematic process. The Borderline-Group method operates within a larger evaluation framework:
Component 1: Purpose Definition
Before setting any cut score, clarify your purpose. Are you certifying minimum competence? Identifying top performers? Distinguishing levels of expertise? The purpose determines everything downstream. "Minimum competence" yields a lower cut score than "superior performance."
Document your purpose explicitly. Example: "This cut score identifies practitioners who can safely perform [specific task] with minimal supervision." This clarity is essential for defending the cut score later.
Component 2: Panel Composition
Borderline-Group requires carefully selected subject matter experts. Typical panels include 8-15 SMEs with:
- Deep expertise in the domain (minimum 10+ years experience)
- Diverse backgrounds and perspectives (prevent groupthink)
- Representation of different role types (practitioners, educators, supervisors)
- Geographic and demographic diversity
- No financial stake in the outcome (prevent bias)
Panel composition directly affects cut score validity. Homogeneous panels produce defensible-sounding but potentially biased cut scores.
Component 3: Test Specification
The test being cut-scored must be clearly specified: what knowledge/skills does it assess? What's the item format? How many items? What domains are covered? This specification becomes the reference point for all standard-setting judgments.
Component 4: Borderline Performance Definition
The core of Borderline-Group methodology is identifying "borderline performers"—practitioners who are minimally competent but not skilled. Define this explicitly before the standard-setting session. Example: "A borderline performer can safely handle routine cases but needs supervision for complex or unusual situations."
Component 5: Judgment Process
Experts make item-by-item judgments about what a borderline performer should answer. This process unfolds across multiple rounds, with discussion and feedback between rounds.
Borderline-Group Methodology Step-by-Step
Phase 1: Planning (Weeks 1-4)
Establish the evaluation infrastructure and recruit the panel. During planning:
- Define evaluation purpose: Write a 1-page statement of why this cut score matters
- Recruit panel: Identify and secure commitment from 10-15 diverse SMEs
- Prepare test materials: Assemble complete test documents, answer keys, and difficulty classifications
- Create briefing materials: Write clear explanations of the Borderline-Group method, panel roles, and expectations
- Schedule sessions: Coordinate panel availability for full-day or multi-day meetings
Planning is where most programs fail. Rushed recruitment leads to homogeneous panels. Inadequate briefing creates confused judges. Invest time here.
Phase 2: Pre-Session Training (Week 5)
Send panelists comprehensive training materials including:
- Overview of Borderline-Group methodology (with case examples)
- Definition of "borderline performer" in your specific context
- Complete test with answer key and marked item difficulty
- Item analysis data (discrimination index, difficulty percentage)
- Instructions for the judgment process
Ask panelists to review this material and come prepared to the session. Pre-training dramatically reduces session time and improves quality.
Phase 3: Session Round 1 - Individual Judgments (Day 1)
Panelists review each test item individually and judge: "Would a borderline performer be likely to answer this correctly?" Responses are typically yes/no or confidence ratings (e.g., 0-100% probability).
Work through items systematically. Don't allow discussion yet—you want independent judgments. Track responses carefully.
This round typically takes 4-6 hours depending on test length. A 100-item test might take 6 hours. A 50-item test takes 3-4 hours.
Phase 4: Analysis & Discussion (Evening of Day 1)
Analyze the Round 1 data. Calculate statistics for each item:
- Panelist agreement: What percentage said "yes" a borderline performer would answer correctly?
- Consensus items: Items where panelists mostly agreed (e.g., 90%+ agreement)
- Disagreement items: Items where opinions varied widely (40-60% agreement)
Prepare summary sheets showing these statistics. These become the basis for discussion.
Phase 5: Session Round 2 - Group Discussion (Day 2)
Reconvene the panel. Present item-level statistics without identifying individual judges. Discuss disagreement items: Why did panelists differ? Is there ambiguity in the item? Different interpretations of "borderline"?
Discussion is NOT voting. It's reasoning together toward better judgments. Reveal panelists' thinking. Identify misconceptions. Clarify borderline performer definition as needed.
Allow panelists to revise their judgments after discussion. Many will; this is healthy. It means discussion is improving judgment quality.
Phase 6: Final Cut Score Calculation
After Round 2, calculate the proposed cut score. The formula:
Cut Score = Sum(% saying "yes" for each item) / Number of Items × Test Score Scaling
Example: If panelists said an average of 72% of items would be answered correctly by a borderline performer, and items are weighted equally on a 0-100 scale, the cut score would be 72.
Present this to the panel. Discuss: Does this cut score make sense? Would you want a practitioner scoring at this level providing service? If not, discuss why and potentially do another round of judgment.
Phase 7: Documentation & Validation
Document every decision:
- Panel roster and expertise summary
- Borderline performer definition used
- Round 1 and Round 2 data for each item
- Panel discussion notes and rationale
- Final cut score with justification
- Panelist feedback and concerns
This documentation is crucial. If your cut score is ever challenged legally or professionally, you'll need to demonstrate the process was rigorous and well-reasoned.
Modified Angoff vs. Bookmark Method Comparison
Borderline-Group is one approach to standard setting. Understanding how it compares to alternatives helps you choose the right method.
| Aspect | Modified Angoff | Bookmark (Body of Knowledge) | Borderline-Group |
|---|---|---|---|
| Item-by-item judgment? | Yes | Partial | Yes |
| Judges borderline performer? | No (judges minimum competence generally) | No (judges item difficulty) | Yes (explicitly) |
| Ease of administration | Moderate | High (simplest) | Moderate |
| Time required | 6-8 hours | 3-4 hours | 8-12 hours |
| Defensibility | Good | Moderate | Excellent |
| Best for | General competence assessment | Large test pools with content experts | High-stakes, safety-critical evaluation |
Modified Angoff Method
Modified Angoff asks: "What percentage of [reference population] would answer this item correctly?" For each item, judges estimate the passing percentage. The cut score is the sum of these percentages divided by number of items.
Modified Angoff is simpler than Borderline-Group but less defensible. It assumes judges can accurately estimate population performance, which research shows they often can't. Judges tend toward extreme estimates (too high or too low).
Bookmark Method
Bookmark (also called Body of Knowledge method) uses a simpler approach: judges arrange items in difficulty order and place a "bookmark" at the point where they'd set the cut. The bookmark location becomes the cut score.
This is fastest and easiest but loses important information about item difficulty and judgment rationale. It's best when you have many items and less need for detailed justification.
Why Borderline-Group Excels
Borderline-Group's strength is its explicitness about the reference point: borderline performers. This makes the process easier to defend. "We identified what borderline-competent practitioners can do, and set the cut score there" is a compelling narrative that regulators and courts respect.
Use Borderline-Group when:
- Cut scores have high stakes (licensing, safety-critical decisions)
- You may face legal challenge to the cut score
- You want maximum defensibility
- You have time and resources for a rigorous process
Use Modified Angoff or Bookmark when:
- Stakes are lower (developmental assessment, skill tracking)
- You need faster turnaround
- You have fewer expert judges available
- Previous standard-setting results make cut score ranges clear
Reliability of Standard Setting Panels
A critical question: How reliable are panel judgments? Would a different panel reach the same cut score? Research shows mixed results.
Factors Affecting Panel Reliability
Panel composition is the strongest predictor. Homogeneous panels (all from same background, same institution) show high internal agreement but often differ substantially from other panels. Diverse panels show lower internal agreement but more robust, generalizable cut scores.
Training quality matters significantly. Panels receiving detailed training on Borderline-Group methodology show better inter-rater reliability (often ICC > 0.75). Untrained panels often show ICC < 0.60.
Item quality affects reliability. Ambiguous items create panelist disagreement. Well-written items with clear correct answers produce better consensus. This is why item review is essential before standard setting.
Borderline performer definition clarity is critical. The more specific and detailed your definition, the higher the panel agreement. Generic definitions ("someone who can do the job") produce weak consensus. Specific definitions ("someone who can handle routine cases but needs help with complications affecting <5% of practice") produce strong consensus.
Measuring Panel Reliability
Use these statistics to assess your panel's reliability:
Intraclass Correlation (ICC): Measures agreement among panelists. ICC > 0.75 is excellent. ICC 0.60-0.75 is good. ICC < 0.60 suggests panelists aren't judging consistently.
Standard Error of the Cut Score: Bootstrap analysis shows what range the cut score might fall within if you repeated the process. A cut score of 72 ± 2 (SEM=2) is reliable. A cut score of 72 ± 8 suggests less stable findings.
Coefficient of Variation (CV): Ratio of standard deviation to mean. CV < 0.05 indicates tight agreement. CV > 0.15 indicates dispersed judgments.
Calculate these metrics during your analysis phase. If reliability is poor, understand why: Is your borderline definition unclear? Are panelists from very different backgrounds? Do you have a few panelists with outlier views? Address the root cause.
Improving Panel Reliability
If your initial reliability metrics are weak, try these improvements:
- Clarify borderline definition: Provide more specific examples of borderline vs. competent vs. incompetent performance
- Review difficult items: Identify items where panelists disagreed and discuss ambiguities
- Use statistical anchoring: In Round 2, show panelists the Round 1 data and discuss outliers
- Do additional rounds: If reliability is still weak after Round 2, do a Round 3 with more discussion
- Replace panelists: In extreme cases, replace panelists whose judgments are statistical outliers
Don't report poor reliability as final—it's a red flag that requires investigation and remediation.
Online vs. In-Person Standard Setting
Increasingly, organizations conduct standard setting remotely. The pandemic accelerated this shift. What works? What doesn't?
| Factor | In-Person | Online/Hybrid |
|---|---|---|
| Discussion quality | Excellent (natural flow) | Good (requires moderation) |
| Participant engagement | High (physical presence enforces focus) | Variable (easy to disengage) |
| Accessibility | Low (travel required) | High (no travel) |
| Geographic diversity | Harder (cost barrier) | Easier (no cost barrier) |
| Cost | High (travel, venue) | Low (platform only) |
| Data capture | Moderate (note-taking) | Excellent (video, transcripts) |
| Cognitive load | Moderate | High (screen fatigue) |
Best Practices for Online Standard Setting
Use high-quality platforms: Zoom/Teams are acceptable but can feel impersonal. Dedicated collaboration platforms (Miro, Mural) work better for visual discussion of item cards and data.
Break into smaller groups: Instead of one 15-person discussion, use breakout groups of 4-5 people. Have each group discuss 20-25 items thoroughly. Then reconvene full group to review disagreement items. This increases engagement and discussion quality.
Use asynchronous rounds: In Round 1, let panelists do judgments on their own schedule within a 3-day window. This accommodates time zones and lets judges work at their own pace. Pool the data and discuss synchronously in Round 2.
Shorten session duration: Don't try to replicate 8-hour in-person sessions online. Instead, do 2-3 hour sessions across multiple days. Screen fatigue is real; people's judgment quality degrades after 3-4 hours of video conferencing.
Improve data visualization: Online, visual data matters more. Create clear dashboards showing item-by-item agreement. Use color coding (green for high agreement, red for disagreement). This helps guide discussion.
Assign a skilled facilitator: Online facilitation is harder than in-person. You need someone who can manage the discussion, prevent dominance by loud voices, pull out quiet participants, and keep things on track. This is worth budgeting for.
Hybrid Approaches
Some organizations do hybrid: core panelists in person, others joining remotely. This can work but creates two-tier engagement. Remote panelists often feel less invested. If you go hybrid, ensure remote participants have equal voice and visibility. Use breakout groups and rotating facilitators to ensure remote voices are heard.
Documenting Cut Score Decisions
Documentation is not bureaucracy—it's your defense. The more thoroughly you document, the more defensible your cut score becomes.
What to Document
Panel composition: Roster with names, titles, organizations, years of experience, and diversity characteristics. Include a justification for why this panel was selected (e.g., "Panel includes [X] practitioners, [Y] educators, [Z] supervisors to represent diverse perspectives").
Borderline performer definition: Write your definition word-for-word. Include examples of borderline vs. competent vs. incompetent performance specific to your context. Document any revisions to the definition during the process (this shows you refined thinking, not arbitrary choices).
Test specification: Describe what the test measures, format, length, content domains, number of items per domain, and difficulty distribution. Include item analysis data (difficulty percentages, discrimination indices).
Pre-session training materials: Attach copies of everything sent to panelists before the session. This shows they were adequately prepared.
Round 1 data: For each item, document the percentage of panelists saying "yes" (borderline performer would answer correctly). Include summary statistics (mean, SD, range).
Round 1 analysis: Document which items had strong consensus and which had disagreement. Explain why you selected disagreement items for discussion.
Discussion notes: For each discussed item, summarize panelists' reasoning. Why did some think a borderline performer should answer it? Why did others disagree? What clarifications emerged?
Round 2 data: Document how many panelists revised their judgments and in which direction. Show post-discussion data for each item.
Cut score calculation: Show the mathematical formula. If you averaged panelist predictions: show the calculation. If you used a different method: explain why.
Final cut score with justification: State the cut score clearly. Explain what it means: "This cut score of 72 indicates that a borderline-competent practitioner should correctly answer approximately 72% of items on this test. Practitioners scoring below 72 likely lack minimum competence for independent practice. Practitioners scoring above 72 demonstrate competence with some areas of strength."
Reliability metrics: Include ICC, standard error, and other reliability data. Interpret these for readers: "The ICC of 0.82 indicates strong agreement among panelists, suggesting the cut score is stable and defensible."
Limitations and uncertainties: Every standard-setting process has limitations. Document yours honestly: "This panel was 70% male, which may not reflect the full diversity of practitioners. We recommend repeating standard setting with a more gender-balanced panel in 3 years." Transparency about limitations actually increases defensibility by showing you're not hiding problems.
Panelist feedback: Include written feedback from panelists on the process. Did they find it clear and fair? What could improve? Include a few verbatim quotes (with permission). This shows you valued their input.
Presentation Format
Create a formal standard-setting report (20-30 pages typically). Structure it:
- Executive summary (1 page): Cut score, methodology, key finding
- Purpose and context (2 pages): Why was this standard setting needed?
- Methodology (3 pages): Describe Borderline-Group process step-by-step
- Panel composition (1 page): Roster and justification
- Test specification (2 pages): What's being evaluated
- Borderline performer definition (1 page): Explicit definition with examples
- Results (4 pages): Round 1 and Round 2 data, analysis, discussion summary
- Final cut score and rationale (2 pages): The recommendation and justification
- Reliability analysis (2 pages): ICC, SEM, and interpretation
- Limitations and recommendations (1 page): What could improve future standard setting
- Appendices: Panel roster, training materials, item-level data, discussion notes, panelist feedback
This level of documentation looks impressive and is. It also makes your cut score nearly bulletproof against challenge.
Implementation & Scaling
Once you've established a cut score through Borderline-Group methodology, how do you implement it? And how do you maintain it over time?
Initial Implementation
Announce the cut score with context. Don't just say "The new cut score is 72." Explain: "Based on expert judgment about borderline competence, the cut score is 72. This means practitioners scoring 72 and above demonstrate minimum competence. Those scoring below 72 likely lack essential skills for independent practice."
Prepare stakeholders for any changes. If the previous cut score was 68 and the new one is 72, some people will fail who would have passed before. Communicate this clearly and explain why (the higher bar reflects updated standards, greater clarity on borderline competence, etc.).
Implement with a transition period if stakes are high. Example: "The new cut score of 72 becomes effective January 1, 2027. Until then, the previous cut score of 68 remains in effect for grandfathering purposes." This reduces disruption while still raising standards going forward.
Maintenance Over Time
Cut scores aren't forever. As the field evolves, content changes, and new panelists bring fresh perspectives, revisit your cut score every 3-5 years. This is standard practice in educational and professional testing.
Between major standard-setting sessions, monitor your cut score's performance:
- Pass rate tracking: Is the percentage of people passing stable? Drifting up or down? Major changes suggest the cut score may need adjustment.
- Outcome tracking: For high-stakes evaluations, track what happens to people who pass vs. fail. If those who passed subsequently fail in their jobs, the cut score may be too low. If many who failed would have succeeded, it may be too high.
- Item performance analysis: As items age, do they remain discriminating? If an item is answered correctly by 95% of test-takers, it no longer contributes to the cut score meaningfully. Replace aging items.
- Stakeholder feedback: Collect feedback from practitioners, supervisors, and users: Is the cut score appropriate? Do people passing it actually perform well? Do failures make sense?
Use this data to inform the next standard-setting cycle. Document these findings and share them with panelists as part of their preparation for re-paneling.
Real-World Case Study: Medical AI Licensing Standard Setting
A healthcare organization built an AI diagnostic support system and needed to establish a cut score for "clinical-grade performance." They couldn't just guess—regulators would demand evidence of how they chose the cutoff.
Panel Composition
They recruited 12 SMEs:
- 5 practicing clinicians (MDs from different specialties)
- 3 medical educators (professors at academic medical centers)
- 2 healthcare quality/safety directors
- 2 regulatory/compliance experts
Gender split: 50-50. Years of experience: 12-34 years (median 19). Geographic: West Coast (3), Midwest (4), East Coast (5).
Borderline Performer Definition
After initial discussion, the panel defined a "borderline clinically-competent AI system" as one that:
Can correctly diagnose 80% of common and moderately common conditions (those representing >1% of practice). May miss rare conditions. When it makes errors, the errors are typically of type that an alert clinician would catch and correct. It can be safely used with physician supervision and cannot be used for independent diagnosis without high-stakes specialty review.
This definition was explicit enough to guide judgment but realistic about what an AI system could do.
Process
They conducted a 2-day session (hybrid, with 8 panelists in-person and 4 remote):
Day 1 morning: Training on Borderline-Group methodology, review of test (120 diagnostic cases), introduction of analysis tool.
Day 1 afternoon: Round 1 judgments. Each panelist, independently, reviewed all 120 cases and judged: "Would a borderline clinically-competent AI system diagnose this correctly?" Entered yes/no for each case into a shared spreadsheet.
Day 1 evening: Facilitation team analyzed data. Created summary showing, for each case, what percentage of panelists said "yes." Identified 24 cases with high disagreement (40-60% agreement) for discussion.
Day 2 morning: Group discussion (full 12 panelists) of disagreement cases. Why did panelists differ on this case? What assumptions about "borderline competence" created the disagreement? After 30 minutes of discussion, panelists revised their judgments asynchronously.
Day 2 afternoon: Review final numbers, validate the cut score, discuss implications.
Results
Pre-discussion (Round 1), average agreement on "yes" judgments: 68.2% across all cases. Post-discussion agreement: 72.4%. This 4.2-point shift indicated discussion improved clarity.
The panel recommended a cut score of 72%, meaning the AI system must correctly diagnose approximately 72% of the test cases to be considered clinically-grade.
Reliability metrics:
- ICC = 0.79 (good-to-excellent agreement)
- Standard error = 1.8 percentage points (cut score of 72 ± 1.8)
- Coefficient of variation = 0.08 (relatively tight distribution)
Outcome
The AI system was tested against the cut score. Initial performance: 71.2% (just below cut score). Development team focused on the 24 cases where the system had erred. After optimization, the system achieved 74.1% on the test set and 73.6% on an independent validation set.
Regulators accepted the standard-setting process and approved the system for clinical use with physician oversight. The documented, rigorous process was the key to regulatory acceptance.
The standard-setting process itself becomes a product. The cut score matters, but the documentation that you chose it rigorously matters even more. Invest accordingly.
Advanced Topics in Standard Setting
Differential Item Functioning (DIF) in Standard Setting
Borderline-Group methodology can mask differential item functioning—items that are easier or harder for certain groups. During analysis, check whether any items show strong DIF. Example: A clinical diagnostic case about a condition more common in one demographic might be systematically easier for panelists from that demographic. Flag these and discuss: Is the item fair? Does it reveal legitimate competence differences or demographic stereotyping?
Equating Cut Scores Across Test Forms
If you administer different test forms (parallel forms), you need to equate cut scores so they're comparable. A 72 on Form A should be equivalent in difficulty to a specific score on Form B. Use equating methods (e.g., Tucker equating, Levine equating) to adjust. This is technical but important if you're using multiple test forms.
Sensitivity Analysis
How sensitive is your cut score to changes in borderline performer definition or panel composition? Try alternative definitions or panels (in simulation, not actual paneling) and see what cut scores result. If the cut score moves by <5 percentage points under reasonable variations, it's robust. If it swings wildly, it's fragile.
Common Mistakes in Standard Setting
Mistake 1: Anchoring on History
Panel knows the previous cut score was 68. Subconsciously (or consciously), they anchor around 68, drifting only slightly. Avoid this by not revealing the previous cut score until after the panel recommends one. Then compare and discuss why they differ.
Mistake 2: Homogeneous Panels
All panelists from the same institution, same demographic, same background. They'll reach consensus quickly but it may not generalize. Diversity is harder to manage but produces more defensible results.
Mistake 3: Insufficient Borderline Performer Definition
Defining "borderline" as "someone who can do the job okay" is too vague. Panelists will interpret this differently. Be specific: what can they do? What can't they? What happens when they make errors? How much supervision do they need?
Mistake 4: Ignoring Item Quality
If items are ambiguous or poorly written, panel judgment will be poor. Do item review before standard setting. Remove ambiguous items. Rewrite poor items. This upstream work pays dividends.
Mistake 5: No Reliability Checks
Computing ICC and finding it's 0.58, then just reporting the cut score anyway. Low reliability is a warning sign. Investigate and remedy. The extra effort is worth it.
Mistake 6: Weak Documentation
Skimping on the report saves time but costs defensibility. When your cut score is challenged (and it will be), your documentation is your only defense. Invest fully.
Key Takeaways
- Purpose matters: Define what you're measuring and why before setting any cut score
- Panel composition: Recruit diverse SMEs with clear expertise. Homogeneity produces weak defensibility
- Borderline definition: Be explicit about what borderline competence means in your context. This is your anchor
- Multiple rounds: Round 1 (independent judgment), Round 2 (group discussion and revision). One round is insufficient
- Reliability check: Compute ICC and standard error. Weak reliability needs investigation and remediation
- Documentation: The process is your product. Document thoroughly. This is your defense
- Maintenance: Revisit cut scores every 3-5 years. Standards evolve; your cut scores should too
Ready to Master Evaluation Standards?
Deep dive into standard-setting methodology and other evaluation disciplines with the CAEE Level 3 program.
Exam Coming Soon