Why Three Phases? The Case for Structured Grading

High-stakes evaluation requires robustness. A single human grader reviewing a complex lab can make mistakes, be influenced by candidate charisma, or simply have a bad day. But if you ask three independent graders to score every submission, you multiply cost and throughput becomes impossible.

The three-phase model solves this by using the right tool at each stage:

  • Phase 1 (Automated Screening): Catch obvious issues (incomplete submissions, format errors, plagiarism, technical errors) with zero human effort. ~20% of submissions are filtered here.
  • Phase 2 (Human Review): Have one trained human scorer review each submission using a detailed rubric, assign a primary score, and document their reasoning. This is your main signal.
  • Phase 3 (Adjudication): For close decisions (score near pass/fail boundary or grader uncertainty), escalate to a senior adjudicator who provides the final score. This prevents borderline candidates from suffering due to one scorer's variance.

Benefits of three-phase grading:

  • Reduces grader burnout: Phase 1 automation eliminates the tedious, repetitive task of checking format. Graders do more meaningful work in Phase 2.
  • Catches errors early: Phase 1 errors are cheap to fix. Phase 2 errors are caught by Phase 3 adjudication.
  • Enables quality escalation: Hard cases escalate to experts. Easy cases are handled by junior scorers, who are cheaper and train faster.
  • Maintains throughput: Phases can run in parallel. While Phase 2 scorers work on submissions 50-100, Phase 1 can screen submissions 101-200.

Phase 1: Automated Screening

What Automated Screening Can Do

Phase 1 is 100% automated. No human involved. The goal is to identify submissions that have obvious, objective problems—the kind of problems that make the submission ungraduable or trivial to fail.

Format and Completeness Checks

Check that the submission has all required files, correct format, and readable. Examples:

  • File presence: Submission should have report.docx, code.py, and results.csv. If any missing, flag.
  • File format validation: Is the submitted PDF actually a PDF, or is it a JPEG renamed to .pdf? Read file header.
  • File size: Is the code file >0 bytes? Is the report >100 bytes (must have content)? Is any file >100MB (suspiciously large, possible corruption)?
  • Encoding: Can the file be decoded as UTF-8? If not, corruption or wrong format.
  • Metadata: Does submission have timestamp, candidate ID, environment info?

Flagging rate: 2-5% of submissions typically fail format checks. These are instant rejections.

Automated Rubric Checks

For rubric items that are objectively verifiable, automate the check. Examples:

  • Code quality: Run a linter (pylint, eslint) on submitted code. Report number of violations. If >100 critical issues, flag for manual review.
  • Testing coverage: Run code coverage tool. If <50% coverage, flag or deduct points automatically.
  • Plagiarism detection: Run MOSS or Copyleaks. If similarity >80% to known sources, flag.
  • Performance benchmarks: If lab has performance requirements (< X seconds execution time, < Y memory), run the code and measure. If it violates requirement, log the violation.
  • Schema validation: If submission should include specific data structure (e.g., JSON with fields A, B, C), validate schema. Report any schema violations.

Output of Phase 1: A structured report with automated checks for each submission. Binary pass/fail for critical items (format, plagiarism). Numeric scores or violation counts for others.

Anomaly Flagging

Use simple heuristics to flag unusual patterns:

  • Submission timing: Was this submitted 2 minutes before deadline (last-minute rush, possible plagiarism)? Or 10 seconds before deadline (likely copy-paste from another candidate)?
  • Size anomalies: Is this report 50 pages (suspiciously long for a 2-hour lab)? Or 1 page (suspiciously short)?
  • Code structure: Does code have all imports at the top (structured)? Or scattered throughout (sloppy)?
  • Comment density: Is code well-commented or missing comments? Calculate comment percentage.

Note: Anomalies don't auto-fail, but they're flagged for human review in Phase 2.

Phase 1 Output and Handoff to Phase 2

Phase 1 produces ascorecard for each submission:

PHASE 1 AUTOMATED SCREENING REPORT
Submission ID: LAB-2026-00487
Candidate: Jane Smith
Timestamp: 2026-02-15 14:32:45

FORMAT CHECKS:
  [PASS] report.pdf present and valid
  [PASS] code.py present and valid (250 lines)
  [PASS] results.csv present (8KB)
  [PASS] All files UTF-8 readable
  
AUTOMATED RUBRIC CHECKS:
  [PASS] Code syntax valid (Python 3.9)
  [WARN] Pylint score: 7.2/10 (3 medium violations, 2 minor)
  [PASS] Test coverage: 78%
  [PASS] Performance benchmark: 0.43s (requirement <1s)
  [PASS] No plagiarism detected (similarity <5%)

ANOMALIES:
  [INFO] Submitted 1 hour before deadline
  [INFO] Report is 8 pages (typical range 5-15)
  [INFO] Code has 35% comments (healthy)

PHASE 1 DECISION: PASS TO PHASE 2
Recommendation: Standard review path (Phase 2), no priority needed.

Phase 2: Human Review and Scoring

Primary Scorer Workflow

In Phase 2, a trained human scorer (the primary scorer) reviews each submission using a detailed rubric. They are responsible for:

  • Reading/reviewing the entire submission
  • Applying rubric criteria and assigning a score for each criterion
  • Generating an overall pass/fail decision or numeric score
  • Documenting evidence for their score (which parts of submission support this score?)
  • Noting any concerns, red flags, or ambiguities

Rubric Application Framework

The rubric should be criterion-referenced (what counts as passing?), not norm-referenced (how does this compare to other candidates?). Example rubric for an engineering lab:

Criterion Excellent (3 pts) Meets Standard (2 pts) Below Standard (1 pt) Does Not Meet (0 pts)
Correctness All tests pass, edge cases handled Most tests pass, handles common cases Some tests pass, but gaps Tests fail, no understanding
Code Quality Clean, readable, well-structured Mostly clean with minor issues Some quality issues but functional Unreadable or poorly structured
Explanation Clear description of approach and tradeoffs Describes approach, minor gaps Brief explanation, lacks detail No explanation or unintelligible

Rubric should have 4-8 criteria, each worth 0-3 points. Total score range: 0-24 points. Passing threshold: e.g., 16/24 (67%).

Evidence Documentation

The scorer must document why they assigned each score. Template:

CRITERION: Correctness [2/3 points]
Evidence: "Code passes 8 of 10 test cases. Missing error handling for 
invalid input types (test cases 4 and 7). Candidate's approach is sound, 
but implementation has gaps."

Evidence Source: Test output log (lines 24-31), code lines 45-48

CRITERION: Code Quality [3/3 points]
Evidence: "Code is well-structured with clear variable names, proper 
indentation, and modular functions. Pylint score 8.5/10."

CRITERION: Explanation [2/3 points]
Evidence: "Report explains the overall algorithm clearly (Section 3), but 
doesn't discuss time complexity or alternatives."

Why documentation matters: When a Phase 3 adjudicator reviews the score, they can see exactly what the scorer saw and why they decided. Vague scores ("good work!") are useless.

Scorer Uncertainty and Escalation Flags

The scorer should mark submissions that are borderline or where they're uncertain:

  • Confidence score: On a 1-5 scale, how confident are they in this score? 5 = very confident, 1 = very uncertain.
  • Escalation flag: "This looks like a pass to me, but the borderline criteria are ambiguous. Recommend Phase 3 review."

Example: A submission scores 16/24 (just barely passes). Scorer marks confidence=2 and flags for escalation. This tells Phase 3: "I think this passes, but I'm not sure. Senior review recommended."

Phase 2 Time Allocation

How long should each submission take to review? Depends on lab complexity:

  • Short labs (2-hour limit): 10-15 minutes to review. Candidate output is 2-3 files, ~20 pages total.
  • Medium labs (4-hour limit): 20-30 minutes. Output is 4-5 files, ~50 pages.
  • Long labs (8+ hour limit): 45-60 minutes. Output is 6-10 files, >100 pages.

Throughput calculation: If a reviewer works 6 hours/day and spends 20 minutes per submission (including breaks and admin), they can review 18 submissions/day. For a 200-submission lab: 200 ÷ 18 = 11 days with one reviewer, or 3 days with three reviewers (running in parallel).

Phase 3: Adjudication and Appeal

Escalation Criteria for Phase 3

Not every submission needs Phase 3. Escalate to Phase 3 if:

  • Score is borderline: Within 2 points of pass/fail threshold. E.g., if passing is 16/24, escalate if score is 14-18.
  • Scorer uncertainty: Scorer confidence < 3 on 1-5 scale.
  • Rubric ambiguity: Scorer notes that the rubric is unclear or contradictory for this submission.
  • Anomalous submission: Phase 1 flagged unusual patterns (e.g., suspected plagiarism, late submission). Adjudicator investigates.
  • Appeal by candidate: Candidate disputes Phase 2 score. Escalate to senior reviewer.
  • Quality check sample: Randomly sample 10% of all passing submissions for quality assurance. Adjudicator reviews to ensure Phase 2 wasn't too lenient.

Expected escalation rate: 20-35% of submissions. This is normal and expected.

Adjudicator Qualifications and Independence

The adjudicator (Phase 3 reviewer) should be:

  • Senior: More experienced than Phase 2 scorers. E.g., if Phase 2 scorers have 1-3 years experience, adjudicators have 5+ years.
  • Independent: Shouldn't have graded the submission in Phase 2. Ideally, they grade blind to Phase 2 score (see submission first, then compare to Phase 2 decision).
  • Trained on rubric: Same training as Phase 2 scorers, plus advanced modules on handling edge cases and ambiguity.
  • Accountable: Adjudicators track their own agreement rate with Phase 2 (should be >80%). If agreement <70%, they need recalibration.

Consensus vs. Averaging for Disagreement

When Phase 2 scorer and adjudicator disagree, how do you resolve it?

Option 1: Consensus. Adjudicator and Phase 2 scorer have a brief discussion (5-10 min) and reach agreement. This is slower but produces higher quality decisions. Only use for critical submissions (e.g., failing is major consequence).

Option 2: Adjudicator overrule. Adjudicator's score becomes the final score. Faster. Use when pass/fail decision is clear and unambiguous.

Option 3: Average. Take the mean of Phase 2 and Phase 3 scores. Simple but can hide disagreement. Not recommended.

Recommendation: Use adjudicator overrule as default (faster). Use consensus only if Phase 2 and Phase 3 disagree by >3 points (major disagreement suggests ambiguity in rubric).

Appeal Handling

Candidates should have a right to appeal their score. Appeals should:

  • Be formalized: Candidate must submit written appeal within 5 days of decision, citing specific rubric criteria they believe were misapplied.
  • Go to adjudicator: If original submission was Phase 2 only, appeal goes to Phase 3 adjudicator. If already went through Phase 3, appeal goes to a different senior reviewer (quaternary review).
  • Result in review, not negotiation: Reviewer re-reads submission and rubric. If they agree Phase 2 made an error, score is revised. But this isn't a bargaining process.
  • Be rare: <5% of submissions should appeal. If >10% appeal, rubric is unclear and needs revision before next cohort.

Inter-Phase Handoffs and Quality Control

What Information Passes Between Phases

Phase 1 to Phase 2: The Phase 1 scorecard (format checks, automated rubric checks, anomalies) goes to Phase 2 reviewer. Reviewer reads this first to understand context. Phase 1 decision (pass/fail) is not binding—Phase 2 can override if they disagree.

Phase 2 to Phase 3: Phase 2 scorer's full report (rubric scores, evidence, confidence level) goes to Phase 3 adjudicator. Adjudicator also gets Phase 1 scorecard. Adjudicator does not know Phase 2's score until after they render their own decision (blind review).

Example handoff document:

=== HANDOFF FROM PHASE 2 TO PHASE 3 ===
Submission: LAB-2026-00487
Phase 2 Reviewer: Alex Chen
Phase 2 Date: 2026-02-15

[Full Phase 2 report with scores and evidence as shown above]

Escalation Reason: Borderline score (17/24, just above 16 pass threshold)
Reviewer Confidence: 3/5 (moderately uncertain)
Reviewer Notes: "Candidate's explanation is borderline. Code works but not elegant."

[HIDDEN FROM ADJUDICATOR UNTIL AFTER INITIAL REVIEW]
Phase 2 Decision: PASS (17/24)

Quality Checks at Phase Transitions

Phase 1 to Phase 2 QC: Spot-check Phase 1 decisions. Sample 20 submissions that Phase 1 rejected (format failures, plagiarism flags). Have Phase 2 reviewer verify these are truly rejectable. If Phase 2 disagrees with >5% of Phase 1 rejections, Phase 1 automation may be too strict.

Phase 2 to Phase 3 QC: Track agreement rate between Phase 2 and Phase 3 scorers. Calculate Cohen's Kappa (inter-rater reliability). Target Kappa >0.70 (substantial agreement). If <0.60, scorers aren't aligned on rubric—conduct rubric re-training.

Final QC (Post-Phase 3): Randomly sample 10 submissions that passed all phases. Have a senior expert review these to ensure quality. If any sampled submissions are actually below standard, it indicates Phase 2 or 3 was too lenient.

Timing, Throughput, and Parallelization

Critical Path Analysis

Which phase is the bottleneck?

  • Phase 1 (automated): Runs in parallel on all submissions. Takes ~5 seconds per submission (even if processing 1000, still just a few minutes total).
  • Phase 2 (human review): Takes 15-60 min per submission depending on complexity. With N reviewers, throughput is N × (submissions per day).
  • Phase 3 (adjudication): Takes 10-30 min per escalated submission. Processes only ~20-35% of all submissions.

Bottleneck is Phase 2 in most projects. Solution: Allocate more reviewers to Phase 2, fewer to Phase 3.

Parallelization Strategy for 200-Submission Lab

Goal: Grade 200 submissions in <2 weeks, maintain quality.

Timeline:

  • Day 1-2: Submissions collected. Phase 1 automated screening on all 200 (few hours of compute, results ready by end of Day 2).
  • Day 3-7: Phase 2 human review. Assign 200 submissions to 5 reviewers (40 each). At 20 min per submission, each reviewer needs ~13 hours = 2-3 days of work. Parallel work → all done in 3 days.
  • Day 8: Phase 1/2 QC checks. Spot-check 20 submissions from each phase.
  • Day 9-12: Phase 3 adjudication. Expected 200 × 25% = 50 escalations. At 15 min each, 50 escalations = 12.5 hours. With 2 adjudicators, done in 2-3 days.
  • Day 13: Appeals window opens. Process any appeals over next 5 days (week 3).

Total throughput: 200 submissions in ~10 business days with 7-10 people (5 Phase 2 reviewers + 2 Phase 3 adjudicators + 1 coordinator).

Score Aggregation Across Phases

Combining Scores from Multiple Phases

If submission is Phase 1 pass only: No human review (automatic low-risk passes). Score = Phase 1 decision.

If submission goes through Phase 2: Phase 2 score is the primary decision. If escalated to Phase 3, final score is either Phase 3 overrule or (if consensus) agreed score.

Example aggregation:

  • Submission A: Phase 1 PASS (no issues) → Final Decision: PASS (no further review needed)
  • Submission B: Phase 1 PASS → Phase 2: PASS (18/24) → Final Decision: PASS
  • Submission C: Phase 1 PASS → Phase 2: BORDERLINE (16/24) → Phase 3: PASS (17/24) → Final Decision: PASS
  • Submission D: Phase 1 FLAG (plagiarism suspected) → Phase 2: PASS (19/24) → Phase 3 (forced escalation): FAIL (5/24 due to confirmed plagiarism) → Final Decision: FAIL

Worked Example: 200-Candidate Deployment Clearance Lab

Lab Description

Lab: "Design and implement a risk assessment framework for an AI model deployment."

Submission format:

  • Design document (10-15 pages)
  • Risk matrix Excel file
  • Sample assessment report (3-5 pages)
  • Code implementation (Python, 200-400 lines)

Rubric (0-4 points per criterion, 5 criteria, 20 points total):

  • Risk Identification: Comprehensive identification of deployment risks
  • Risk Assessment: Proper quantification and prioritization
  • Mitigation Strategy: Feasible, specific mitigations
  • Code Quality: Implementation is correct and maintainable
  • Communication: Clear documentation and explanation

Pass threshold: 14/20 (70%)

Phase 1 Results (Day 2)

200 submissions screened in ~1 hour of compute time:

  • Pass format checks: 195
  • Fail format checks: 5 (missing files, corrupted documents)
  • Plagiarism detected: 3 (>80% similarity to known sources)
  • Code syntax errors: 12 (code doesn't run)
  • Anomalies flagged: 23 (various red flags)

Phase 1 decisions: 195 PASS to Phase 2, 5 FAIL (format). Flagged 3+12+23 = 38 for priority review in Phase 2.

Phase 2 Results (Day 3-7)

5 reviewers grade 195 submissions (40 each). Results summary:

Score Range # Submissions Decision Escalate to Phase 3?
17-20 89 PASS (clear) No
14-16 62 PASS (borderline) Yes
10-13 35 FAIL (borderline) Yes
0-9 9 FAIL (clear) No

Phase 2 escalations to Phase 3: 62 + 35 = 97 submissions flagged for adjudication (including the 38 Phase 1 red flags).

Agreement rate check: Compare reviewer scores on a set of 10 submissions reviewed by 2 reviewers each. Cohen's Kappa = 0.78 (good agreement).

Phase 3 Results (Day 9-12)

2 adjudicators review 97 escalated submissions. Results:

  • Phase 2 PASS borderline (14-16): 62 submissions. Phase 3 confirms 58 PASS, revises 4 to FAIL. (Agreement: 93%)
  • Phase 2 FAIL borderline (10-13): 35 submissions. Phase 3 confirms 30 FAIL, revises 5 to PASS. (Agreement: 86%)
  • Phase 1 red flags (plagiarism + syntax errors): 12 submissions. Phase 3 confirms 11 FAIL, 1 PASS (submitted working code and addressed plagiarism concerns). (Agreement: 92%)

Final decisions:

PASS (Clear): 89 submissions
PASS (Borderline, confirmed): 58 submissions
PASS (Borderline, revised from FAIL): 5 submissions
Total PASS: 152 (76%)

FAIL (Clear): 9 submissions
FAIL (Borderline, confirmed): 30 submissions
FAIL (Borderline, revised from PASS): 4 submissions
FAIL (Red flags): 11 submissions
Total FAIL: 48 (24%)

Appeals window (Day 13-19): 8 candidates appeal their FAIL decisions. 2 appeals are upheld (Phase 3 reconsiders and agrees with appeal). 6 appeals are denied. Final tally: 154 PASS, 46 FAIL.

Three-Phase Framework for LLM-as-Judge Pipelines

Applying Three-Phase Structure to Automated Evaluation

The three-phase framework isn't just for human graders. It also works for LLM-based evaluation pipelines:

Phase 1 (Automated Screening): Run deterministic checks on submission (format, file presence, code syntax). Same as above.

Phase 2 (LLM Scoring): Use Claude or GPT-4 to score submission against rubric. Prompt: "Grade this submission on criterion X using this rubric [rubric]. Return score 0-4 with evidence."

Challenge with Phase 2 LLM: LLM scoring has consistency issues (might score same submission differently on different days). Solution: Run scoring 3 times, take median. This adds cost but improves robustness.

Phase 3 (Human Adjudication): For borderline LLM decisions or high-stakes submissions, escalate to human expert. Human reads submission + LLM scores and makes final call.

Hybrid Pipeline Example

Cost-optimized 200-submission lab using hybrid:

  • Phase 1: Automated checks (5 min compute, zero cost)
  • Phase 2: LLM scoring on all 195 submissions. Use Claude with 3× scoring for consistency. Cost: 195 × 3 × $0.01 per score = $5.85. Time: <5 min (parallel).
  • Phase 3: Human adjudication on 50 borderline submissions. Cost: 50 × 0.25 hr × $60/hr = $750. Time: 3-4 days with one adjudicator.

Total cost: ~$750 (almost entirely Phase 3 labor). Total time: 5-10 business days.

Comparison to all-human pipeline: All-human Phase 2 (5 reviewers × 40 submissions × 30 min = 100 hours × $50/hr = $5,000) + Phase 3 ($750) = $5,750 total.

Savings with hybrid: $5,000. But quality risk is higher (LLM scoring less robust than human). Use hybrid for low-stakes assessments, all-human for high-stakes.

Summary and Implementation Checklist

Three-Phase Grading Essentials

  • Phase 1 (Automated): Format checks, plagiarism detection, code linting, anomaly flagging. ~20% filtered. Cost: near-zero.
  • Phase 2 (Human): Primary scoring using detailed rubric. One reviewer per submission. Document evidence. Cost: ~$50-100 per submission (labor).
  • Phase 3 (Expert): Adjudication on ~25% of submissions (borderlines, appeals, quality samples). Provides final score. Cost: ~$30-50 per submission.
  • Parallelization: All phases can run in parallel. For 200 submissions: Phase 1 in hours, Phase 2 in 3-5 days (5 reviewers), Phase 3 in 2-3 days (2 adjudicators).
  • Quality gates: Check inter-rater agreement (Cohen's Kappa >0.70), spot-check Phase 1 and 2 decisions, sample final decisions for quality assurance.
  • Appeals: Formal process, limited window (5 days), low rate expected (<5%). Goes to adjudicator independent from original scorer.
  • For LLM-as-judge: Phase 1 automated, Phase 2 LLM scoring (with 3× repetition), Phase 3 human adjudication on borderlines. Hybrid reduces cost 50-70% vs. all-human.

Implement Three-Phase Grading for Your Next Lab

Start with a pilot: 50 submissions through all three phases. Measure inter-rater agreement and time per phase. Use this to calibrate your full deployment.

Explore Three-Phase Grading