Why Mentorship Is a Commander Obligation
Expertise compounds through teaching. When you mentor, you sharpen your own understanding of evaluation. You learn how others struggle, and that teaches you where the field has gaps. You see new approaches from mentees and sometimes they teach you something. This reciprocity is why mentorship is not charity—it's enlightened self-interest.
More importantly, the field advances through transmission. Knowledge that stays in one person's head dies with them. Knowledge shared through mentorship multiplies. Commanders have a responsibility to mentor the next generation not because we're generous, but because AI evaluation is too important to let knowledge gaps persist.
The EAC Mentorship Model: Structure and Expectations
Duration: 8-12 weeks minimum (can extend longer). Structured, regular meetings (typically 1x/week).
Commitment: 2-4 hours/week for mentor. Mentee commitment varies (6-10 hours/week during lab prep).
Cadence: Mentor and mentee agree on schedule. Typical: Tuesday 3pm PT, 1 hour, recurring. Consistency matters more than duration.
Communication: Primary channel is video call. Asynchronous feedback via email/Slack between sessions acceptable. Office hours before exams optional but valued.
Outcome: Mentee is better prepared for L3 lab or portfolio review. Success = mentee passes their assessment.
Who Commanders Mentor: Types of Mentees
Type 1: L2→L3 Transition Candidates
L2 professionals preparing for the L3 CAEE lab exam. This is the most common mentorship relationship. Mentee is studying for a 4-week remote lab, needs coaching on evaluation design, experiment running, report writing. Typical mentorship duration: 8-10 weeks before lab.
Type 2: L3→L5 Portfolio Candidates
L3 professionals preparing to submit an L5 Commander portfolio. Mentee needs guidance on portfolio structure, artifact selection, depth of contribution, writing quality. Typical duration: 12-16 weeks before submission.
Type 3: Domain-Specific Mentees
Professionals wanting to specialize in a specific evaluation domain (multimodal, safety, healthcare, etc.) where the mentor is known to be expert. Duration: 4-12 weeks depending on depth desired.
Type 4: Career Transition Mentees
People from adjacent fields (ML engineering, product management, data science) wanting to transition into AI evaluation specialization. Duration: 12-24 weeks, longer and deeper.
The Matching System: Getting the Right Mentor-Mentee Pair
Skill alignment: Mentor has expertise in the mentee's area of need. If mentee is weak on multimodal evaluation and mentor specializes in NLP, the match is imperfect but not disqualifying if mentor has general evaluation strength.
Industry matching: Mentor has worked in mentee's industry or similar. A healthcare evaluation expert can mentor someone entering finance. Exact match not required, but domain fluency helps.
Timezone compatibility: Real-time mentorship requires reasonable timezone overlap. A US-based mentor and India-based mentee (8.5 hour offset) can make it work with both adjusting times, but Europe-to-Asia is harder.
Communication style fit: Some mentors are directive ("do this"); others are Socratic ("what do you think?"). Some mentees need structure; others prefer autonomy. Misalignment here causes friction. Eval.qa matching process tries to optimize this.
Availability and commitment: Both mentor and mentee must have realistic expectations. "I can do 30 min every other week" won't work for a CAEE lab prep mentee. Be honest about capacity.
How to find a mentor: Use the eval.qa mentorship matching platform (questionnaire-based). Or ask your network ("Do you know a Commander who specializes in X?"). Many Commanders have "office hours" for informal mentorship, low-key way to test fit before formal relationship.
Mentorship Session Structure: Recommended Agendas
Session Type 1: Calibration on Gold Standards (45 min)
Goal: Train mentee's evaluative judgment using concrete examples.
- Setup (5 min): Mentor presents a gold-standard example (e.g., an AI response to a prompt, with human expert judgment).
- Mentee assesses (10 min): Mentee independently scores/evaluates the example using the rubric at hand. No discussion yet.
- Compare & discuss (20 min): Mentor reveals expert scoring, explains the reasoning. How did mentee differ? What were they missing?
- Repeat (10 min): Another gold standard, mentee tries again. Usually shows immediate improvement.
- Wrap-up (3 min): What pattern did mentee learn? What should they focus on next?
Session Type 2: Portfolio or Project Review (45 min)
Goal: Provide critical feedback on mentee's work-in-progress.
- Submission (async, before session): Mentee sends draft (evaluation design, portfolio section, experiment plan, etc.).
- Mentee walkthrough (10 min): Mentee explains their thinking, key decisions, where they're uncertain.
- Mentor feedback (25 min): Mentor gives structured feedback: (1) What's strong, (2) What needs work, (3) Specific improvements. Focus on 2-3 key issues, not everything.
- Mentee reaction (8 min): Mentee responds, asks clarifying questions, plans revision.
- Wrap-up (2 min): Clear action items for next time.
Session Type 3: Skills Gap Workshop (45 min)
Goal: Deep dive on a specific skill mentee is weak in (statistical analysis, rubric design, human annotation, contamination detection, etc.).
- Diagnosis (5 min): Mentor and mentee agree on the skill to focus on.
- Teaching (20 min): Mentor explains the concept, walks through examples, shares mental models. This is lecture-style but short and focused.
- Practice (15 min): Mentee applies the skill to a new example. Mentor coaches. "Try computing the correlation coefficient for this dataset. How would you interpret the result?"
- Consolidation (5 min): Summary. Where does mentee apply this skill in their work? What should they practice before next session?
Required Documentation and Progress Tracking
Session logs: After each session, mentor writes a brief log (5 min of effort): date, attendees, topics covered, mentee progress, action items for next session. Template:
Date: [date]
Topic: [e.g., "Calibration on VQA metrics," "Fairness in underwriting evaluation"]
Key points covered: [2-3 bullet points]
Mentee progress: [what did they understand/improve?]
Next session: [what will we focus on?]
Notes: [any concerns or observations?]
Progress tracking: Every 4 weeks, mentor fills out a progress assessment:
- Overall progress (1-5 scale): Is mentee on track for their goal?
- Strengths emerging: What is mentee getting good at?
- Areas needing work: What should we focus more on?
- Mid-course corrections: Do we need to adjust the mentorship plan?
Final mentorship report (end of mentorship): 1-2 page summary: mentee's starting point, key improvements, final readiness assessment, recommendation for next steps. This document is shared with mentee and can be included in their portfolio or lab submission.
Core Mentoring Skills for Evaluators
1. Giving Calibrated Feedback (Not Just Opinions)
Bad feedback: "Your evaluation rubric is unclear."
Good feedback: "Your rubric has three issues: (1) the 'high quality' definition could apply to both 4-star and 5-star responses; (2) no guidance on how to score partial correctness; (3) doesn't address the fairness dimension you said mattered. Here's how to fix each."
Calibrated feedback is specific, actionable, and explains the reasoning behind it.
2. Modeling Good Evaluation Thinking
When you mentor, you're not just transferring information; you're showing how a good evaluator thinks. This means:
- Being explicit about tradeoffs: "I chose sensitivity over specificity here because false positives cost more in this application."
- Showing uncertainty: "I'm not sure how to handle this edge case. Let me think through the options."
- Iterating visibly: "My first rubric was too complex. I simplified it. Here's what I learned."
3. Teaching Pattern Recognition
Expert evaluators recognize patterns: "This error looks like out-of-distribution failure," "This evaluation is suffering from annotation bias," "This metric is collapsing because the dataset is too easy." Mentees need to learn to see these patterns.
Train this through examples: "Look at these 5 model failure cases. What do they have in common? [Mentee thinks] Right—they all involve numbers outside the training distribution. This is a specific failure mode we should evaluate for explicitly."
4. Asking the Right Questions
Sometimes the best mentoring is asking questions instead of giving answers:
- "Why did you choose Pearson correlation? What would Spearman have told you differently?"
- "You say the evaluation is valid. But have you checked it against real user outcomes?"
- "Your rubric gives no guidance on this ambiguous case. How would you handle it?"
This Socratic method develops independent thinking, not dependence on mentor.
Common Mentor Mistakes and How to Avoid Them
Mistake 1: Being Too Directive
Problem: "Use BLEU score for this task. That's what I always do." This prevents mentee from learning to make their own evaluation choices.
Fix: Guide without prescribing: "BLEU score has these strengths [X] and weaknesses [Y]. For your use case, you might consider BLEU or BERTScore or human evaluation. What are the tradeoffs you're trying to optimize?"
Mistake 2: Grade Inflation
Problem: You like your mentee and want to encourage them, so you overpraise everything. "Your rubric is great!" when it has real issues. This sets them up for failure in the lab/portfolio review.
Fix: Be honest and kind simultaneously: "I see real strength in your approach here [specific praise]. There's also an issue I want you to think about [specific concern]. Here's how to improve it."
Mistake 3: Abandoning Mentees Mid-Process
Problem: You commit to 8 weeks, but then get busy and ghost for 3 weeks. Mentee loses momentum. Mentorship breaks down.
Fix: Be honest about capacity before you start. If you can only do 2x/month reliably, say so. And if you need to step back, communicate clearly and help find a replacement mentor.
Mistake 4: Solving Problems For Them
Problem: Mentee gets stuck on evaluation design, so you design it for them. Now they've learned nothing about how to do it themselves.
Fix: Let them struggle productively. Offer scaffolding: "Walk me through your thinking. Where are you stuck? What are the options you're considering?" Help them think through it, don't think for them.
The Four Mentorship Phases
Phase 1: Orientation (Weeks 1-2)
Focus: Understand mentee's goals, baseline, strengths, gaps. Build rapport and establish norms.
Typical activities:
- Intake conversation: What's your background? What do you want to improve? What are you anxious about?
- Goal setting: By the end of 8 weeks, you'll be able to [specific, measurable goal].
- Establish cadence and norms: Same time each week, async work expected, communication style.
- Diagnostic: Quick assessment of current knowledge (e.g., have mentee design a simple evaluation, see what they do).
Phase 2: Skill Building (Weeks 3-5)
Focus: Targeted instruction on weak areas. Calibration sessions, skills workshops, practice problems.
Typical activities:
- Calibration sessions on gold standards (2-3 per week)
- Skills workshops (statistical analysis, rubric design, annotation management, etc.)
- Mentee brings their own project; mentor provides feedback.
- Reading assignments (papers, blog posts to build background)
Phase 3: Independent Practice (Weeks 6-7)
Focus: Mentee does more of the work; mentor increasingly coaches rather than teaches.
Typical activities:
- Mentee designs an evaluation (for real project or hypothetical). Mentor gives feedback.
- Mentee analyzes a dataset for patterns. Mentor asks probing questions.
- Mentee writes a rubric. Mentor helps refine it through dialogue.
- Sessions shift from teaching → coaching: "What are you thinking here?" rather than "You should do this."
Phase 4: Transition (Week 8+)
Focus: Prepare mentee for independence. Review readiness, address final questions, give confidence boost.
Typical activities:
- Final project review: Is mentee ready for the lab or portfolio submission?
- Mock session: If mentee is taking CAEE lab, do a practice version of that.
- Reflection: What did mentee learn? What should they keep practicing?
- Closure: Good ending with clear summary of growth.
Measuring Mentorship Effectiveness: Real Data
The main question: Does mentorship improve mentee outcomes?
Data shows mentored candidates pass their assessments at higher rates:
- L3 CAEE lab pass rate (mentored): 73%
- L3 CAEE lab pass rate (unmentored): 51%
- Improvement: +22 percentage points
This is not a small effect. It's the difference between maybe passing and likely passing.
Secondary indicators of mentorship effectiveness:
- Mentee confidence (self-reported): Did mentee feel more confident in evaluation thinking?
- Work quality improvement: Are drafts/projects getting visibly better over 8 weeks?
- Mentor learning: Did mentor report learning from mentee? (91% of mentors say yes)
- Mentee employment: Are mentees hired into evaluation roles? (Mentored candidates convert to jobs at 2.3x rate of unmentored)
Mentorship Boundaries and Academic Integrity
What mentors DO:
- Teach evaluation concepts and methods
- Give feedback on mentee's work
- Share examples and case studies
- Model good evaluation thinking
- Help mentee debug their approaches
- Coach on how to tackle a problem
What mentors DON'T do:
- Write or design the evaluation for the mentee (they do the work themselves)
- Take the CAEE lab for them (obviously)
- Write portfolio artifacts (mentee writes them; mentor reviews)
- Give them exam questions or answers
- Inflate grades or pass candidates who aren't ready
The boundary: Mentors help mentees learn to do evaluation themselves, not do it for them.
Group vs. 1:1 Mentorship
1:1 Mentorship (Most Common)
Pros: Personalized, focuses on mentee's specific gaps, strong relationship, flexible pacing.
Cons: Time-intensive (2-4 hours/week per mentee), hard to mentor multiple people simultaneously.
Group Mentorship (Calibration Workshops)
What it is: 6-8 L3 candidates + 1-2 Commanders, monthly or bi-weekly workshops focusing on specific topics or calibration on gold standards.
Pros: Efficient (one mentor reaches many), peer learning (mentees learn from each other), networking, fun/social.
Cons: Less personalized, can't address individual gaps as deeply, requires coordinating schedules.
Best combined approach: 1:1 mentorship for serious lab prep (8-week intensive), group workshops for broader community building (open to all, monthly).
EHCS Credit for Mentoring
What is EHCS? Continuing Education Hours for Specialists (EHCS). Commanders must earn EHCS credits annually to maintain L5 status. Mentoring counts toward these credits.
Credit calculation:
- 1:1 mentorship session (1 hour) = 1 EHCS credit
- Group workshop delivery (2 hours) = 2 EHCS credits
- Mentorship program design/management = 3 EHCS credits/year
- Session documentation and progress reports = 0.25 EHCS credits per report
Annual requirement: Commanders must earn 20 EHCS credits/year to maintain status. Active mentors (4 mentees × 1 hour/week × 52 weeks) earn ~208 credits annually, far exceeding the minimum. Most don't need other EHCS activities.
Verification: You document mentorship via the platform (session logs, progress reports). Credits auto-calculate.
Frequently Asked Questions
How many mentees can I take on at once?
Recommended: 3-4 mentees simultaneously. This keeps you at 6-12 hours/week, manageable for most people. More than 4 and you're probably not giving each mentee quality time. If you're 1:1 mentoring plus running group workshops, scale back 1:1.
What if a mentee fails the lab or portfolio?
It happens (~27% of mentored candidates fail). It doesn't reflect on you as a mentor, though it's disappointing. Debrief: What went wrong? Was it a knowledge gap, test anxiety, poor time management? Help them understand what to improve for the next attempt.
How do I handle a difficult mentee (defensive, not taking feedback, disengaged)?
Address directly: "I've noticed you seem resistant to feedback on X. Help me understand what's going on." Often there are reasons: imposter syndrome, past negative feedback, misalignment on goals. Once you understand, you can adapt. If it doesn't improve after explicit conversation, it's okay to end the mentorship: "I don't think I'm the right mentor for you. Let me help you find someone better suited."
Can I mentor someone from my own company?
Yes, this is common. Just be careful of conflicts: if your mentee reports to you, keep mentor-mentee relationship separate from manager-report relationship. Don't let performance reviews bleed into mentorship.
Should I charge money for mentorship?
The eval.qa model is that mentorship is part of your Commander responsibility, unpaid. Some Commanders do charge for "consulting" that goes beyond mentorship scope. Be clear about the distinction and get approval from eval.qa if you're charging.
How do I stay current if I'm mentoring on topics I'm not deeply current on?
Pair with a more expert mentor in that domain, or be honest with mentee: "This isn't my area of expertise. Let me connect you with someone better positioned to mentor you on X." Good mentorship includes knowing when you're the wrong mentor.
