What the Portfolio Is and Why It Exists
The L5 Commander portfolio exists because passing the exam is not enough to demonstrate strategic leadership in AI evaluation. The exam tests knowledge. The portfolio tests judgment, execution, and impact.
You cannot reach Commander level by pure study. You must demonstrate that you can:
- Design evaluation programs that solve real problems (Artifact 1)
- Contribute to the field and advance industry thinking (Artifact 2)
- Develop other evaluators and spread expertise (Artifact 3)
The portfolio is evaluated by a panel of 3-5 expert evaluators, not just the automated exam system. They look for evidence of strategic thinking, not just technical competence.
The portfolio is harder than the exam. This is intentional. The Commander credential should be rare and valuable.
The 3 Mandatory Artifacts
You must submit all three artifacts. You cannot skip one or "compensate" with a stronger version of another. The portfolio is holistic.
| Artifact | Scope | Format | Target Length | Weight |
|---|---|---|---|---|
| Eval Program Design | Design for a real evaluation initiative | Document (PDF or markdown) | 15-25 pages | 35% |
| Published Contribution | Public intellectual contribution | Blog, paper, talk, or tool | Varies | 35% |
| Mentorship Evidence | Documentation of mentoring 1-2 evaluators | Structured documentation | 10-15 pages + artifacts | 30% |
Artifact 1: Eval Program Design (35% weight)
This is a comprehensive design document for an evaluation program. Not a proposal; a design. The program should be:
- Real: Either implemented or implementable. Not hypothetical.
- Significant: Addressing a material business or technical problem.
- Your design: You led it, not just contributed to it.
Required Sections
1. Problem Statement (2-3 pages)
What problem does this evaluation solve? Examples:
- "Our recommendation engine has unknown performance across user segments. We ship features without knowing if they help or hurt different cohorts."
- "Our medical AI system is deployed but we have no systematic safety monitoring. A failure could harm patients."
- "Our code generation model powers 10K developers. We don't know if it's getting worse (model drift). This program establishes continuous monitoring."
The problem statement should feel urgent. Why does this matter now? What's the risk of inaction?
2. Evaluation Architecture (3-4 pages)
High-level design of your evaluation approach:
- What systems are being evaluated? (specific models, not vague)
- What are the key metrics? (at least 5-7, with justification for each)
- Who evaluates? (humans, AI judges, automated)
- What's the evaluation frequency? (continuous, monthly, quarterly)
- How do you handle edge cases? (long-tail scenarios, rare events)
3. Governance and Ownership (2 pages)
Who is responsible for what? Examples:
- Product team owns evaluation budget
- ML Engineering runs the evaluation pipeline
- Ethics committee reviews safety evaluation results
- Executive steering committee sets go/no-go for deployment based on eval results
Include a RACI matrix (Responsible, Accountable, Consulted, Informed) for key decisions.
4. Rollout Timeline (2-3 pages)
Month-by-month or phase-by-phase plan. Example:
- Weeks 1-4: Assemble team, define metrics, build evaluation dataset
- Weeks 5-8: Run baseline evaluation on current system
- Weeks 9-12: Implement automated evaluation pipeline
- Weeks 13-16: Deploy to production, integrate with CI/CD
- Month 5+: Monitor, iterate, expand to related systems
5. Budget and Resource Requirements (2 pages)
Be specific:
- Staffing: 1 FTE evaluator, 0.5 FTE engineer for tooling
- Tools: $50K/year for LangSmith + human annotation platform
- Data: $20K for creating evaluation dataset
- Training: $10K for team training
- Total: $150K year 1, $80K year 2+
Frame it as investment, with expected ROI.
6. Success Criteria and Measurement (2 pages)
How will you know the evaluation program succeeded? Examples:
- Process metrics: Evaluation runs automatically every week; 95%+ success rate
- Business metrics: 40% reduction in post-deployment issues; 2-week faster feature delivery
- Quality metrics: Evaluators achieve >85% inter-rater reliability
Quality Criteria for Artifact 1
| Criterion | Acceptable (70+) | Strong (85+) | Exceptional (95+) |
|---|---|---|---|
| Problem clarity | Problem is clear; urgency is implied | Problem is clear and urgent; quantified | Problem is quantified with stakeholder validation |
| Architecture feasibility | Design is reasonable; some gaps exist | Design is detailed and implementable | Design is implementable with evidence of piloting |
| Metric selection | Metrics are chosen; some lack justification | All metrics are justified; align with problem | Metrics are justified, weighted, and validated |
| Governance clarity | Ownership is defined but vague | Clear ownership with decision rights | RACI matrix; escalation paths defined |
| Timeline realism | Timeline is provided but aggressive | Timeline is realistic with milestones | Timeline based on actual implementation experience |
Artifact 2: Published Contribution (35% weight)
You must make a public intellectual contribution to the field of AI evaluation. This is not internal documentation; it's external, for the community.
What Counts as Published Contribution
- Technical blog post: 3,000+ words on eval.qa blog, Medium, or company blog (publicly accessible)
- Open-source tool: GitHub repo with 500+ stars, solving a real eval problem
- Academic paper: Published in a peer-reviewed venue (e.g., ACL, NeurIPS workshops, FAccT)
- Conference talk: 30+ minute technical talk at a major conference (NeurIPS, ICML, TMLR, Anthropic LLM Day, etc.)
- Research methodology: Documented framework/methodology that others can apply
- Benchmark contribution: Created or significantly improved a public evaluation benchmark
Quality Bar: What Doesn't Count
- Internal-only documentation (not published)
- Company blog post that's just marketing (not technical depth)
- Open-source tool with 50 stars and no usage (not adopted)
- Conference talk on a basic topic without novel insight
- Anything that's a rehash of existing work without new contribution
Examples of Strong Contributions
Blog Post Example: "Multi-Dimensional Evaluation of RAG Systems: Beyond BLEU Scores" — 4,500 words, published on eval.qa blog, 5,000+ reads, cited by others
Tool Example: "EvalMetrics" — Open-source Python library for domain-specific evaluation metrics, 800 GitHub stars, 50 organizations using it
Paper Example: "Causal Inference in AI Evaluation: Separating Correlation from Impact" — Published at FAccT 2025, 20+ citations
Talk Example: "Building Evaluation Culture: A 500-Person Case Study" — 35-minute technical talk at NeurIPS 2025, 200+ attendees
Minimum Thresholds
- Blog: 3,000+ words, 1,000+ views, 6+ months publicly available
- Tool: 500+ GitHub stars, documented, used by 10+ organizations, active maintenance
- Paper: Published (preprint alone is insufficient), peer-reviewed venue
- Talk: 30+ minutes, major conference, recorded and publicly available
If you're borderline (e.g., blog with 2,000 words), you may be asked to strengthen it before final acceptance.
Artifact 3: Mentorship Evidence (30% weight)
Strategic leadership means developing others. This artifact documents your mentorship of 1-2 evaluators.
What Mentorship Means
Not giving a presentation or writing a tutorial. Actual 1-on-1 development of another evaluator, where they:
- Progress from one level to the next (e.g., L2 to L3)
- Can design and run evaluations independently after mentorship
- Cite you and acknowledge your guidance
Required Documentation
For each mentee, provide:
- Initial assessment: Where they started (skill level, role, background)
- Learning plan: What you committed to teach them (6-month roadmap)
- Session log: Month-by-month summary of sessions:
- Topics covered
- Challenges they faced
- Progress made
- Evidence of growth:
- Evaluation they designed independently
- Blog post they wrote
- Certification they earned
- Project they led
- Mentee testimonial: How they benefited (1-2 paragraphs in their own words)
- Mentee letter of recommendation: Optional but powerful
Quality Examples
Strong mentorship evidence:
I mentored Sarah, a junior ML engineer with 2 years experience. She had zero evaluation background. Over 6 months, I guided her through:She now leads evaluation for 2 ML systems independently. She passed the L2 evaluation exam and is a core member of our eval center of excellence.
- Month 1: Basics of eval, human evaluation design, annotation rubrics
- Month 2: Designing eval for her team's recommendation system
- Month 3: Running first evaluation, interpreting results
- Month 4: Automation, building eval pipeline
- Month 5-6: Leading eval for a new project, mentoring a junior intern herself
Weak mentorship evidence:
I mentored 5 people in evaluation. I gave them access to my notes and recommended they take the eval.qa courses. They were interested in learning about evaluation metrics.
The second example is not mentorship; it's resource sharing. Mentorship is intentional, structured, and demonstrates growth.
Portfolio Evaluation Rubric
Your portfolio is graded on a 0-100 scale across 5 dimensions. You need 75+ overall to pass.
| Dimension | Excellent (95-100) | Strong (85-94) | Adequate (75-84) | Weak (<75) |
|---|---|---|---|---|
| Strategic Thinking | Design solves a critical problem; shows sophisticated understanding of org context | Design solves a real problem; good understanding of constraints | Design solves a defined problem; some gaps in sophistication | Problem is vague; limited strategic insight |
| Execution Quality | Evidence of successful execution; metrics show documented impact | Design is detailed and implementable; some execution evidence | Design is clear; limited execution evidence | Design lacks clarity or feasibility |
| Field Contribution | Published work that's novel and widely adopted; 1000s cite or use it | Published work that's solid; adopted by meaningful audience | Published work that's competent; limited adoption | Work is not published or published but barely adopted |
| Leadership & Mentorship | Mentees have progressed multiple levels; are now mentoring others | Mentees have progressed 1-2 levels; are independent evaluators | Mentees have gained skills; limited independence demonstrated | Mentorship is not structured or evidence is weak |
| Communication | All artifacts are exceptionally well-written; ideas are crystalline | Artifacts are well-written; ideas are clear | Artifacts are readable; some clarity issues | Artifacts are hard to follow; unclear writing |
Timeline and Submission Process
Submission Windows
Portfolios are reviewed in three windows per year:
- Window 1: January-February (deadline Jan 31)
- Window 2: May-June (deadline May 31)
- Window 3: September-October (deadline Sept 30)
Panels review and respond within 4 weeks of the deadline. You'll hear: Accepted, Revision Required, or Rejected.
Format Requirements
- Eval Program Design: PDF or markdown, max 30 pages
- Published Contribution: Link to public URL (blog, GitHub, paper, etc.) + 2-page summary
- Mentorship Evidence: PDF with structure as described above
- Cover letter: 1 page describing your portfolio
- Anonymization form: Indicates any proprietary information that needs redaction (see below)
All files must be submitted via the eval.qa portal. Max file size: 100MB total.
Common Rejection Reasons (and How to Avoid Them)
Top Rejection Reasons
1. "Problem Statement Is Too Vague" (12% of rejections)
Weak: "We need better evaluation of our AI models."
Strong: "Our recommendation engine serves 50M users across 3 geographies. We found a 40% performance gap between desktop and mobile users, but only after we deployed a new ranking algorithm. We need systematic evaluation to catch such gaps before deployment."
Fix: Be specific. Use numbers. Describe the failure mode.
2. "Design Is Proposed, Not Implemented" (14%)
Acceptable: "We designed and implemented this evaluation program. It's now running in production."
Weak: "We proposed this evaluation framework, but it was never implemented."
Fix: If it's not implemented, provide a detailed plan for implementation and pilot it yourself if possible. Or choose a different project that you've actually executed.
3. "Contribution Is Too Niche or Too Basic" (18%)
Too niche: "I wrote a blog post on evaluating small language models for 3-word summaries." (Audience: ~10 people)
Too basic: "I wrote a blog post explaining what BLEU scores are." (Competent but not novel)
Strong: "I developed a new methodology for counterfactual evaluation of recommendation systems, published at FAccT, now used by 15+ companies."
Fix: Choose a contribution that's both novel and relevant to the broader community. Aim for 1000+ potential readers/users.
4. "Mentorship Is Not Structured" (16%)
Weak: "I helped several people learn about evaluation. They found it helpful."
Strong: "I mentored Alice from L1 to L3 over 6 months. We had bi-weekly sessions. She now leads evaluation for 2 products and mentors others."
Fix: Focus on 1-2 mentees. Document everything. Get their written feedback. Show concrete outcomes.
5. "Unclear Writing; Hard to Follow" (11%)
Reviewers are expert evaluators, but they shouldn't need to hunt for meaning in your writing.
Fix: Have someone outside your field review it. Are ideas clear? Do claims have evidence? Is the structure logical?
Adequate vs. Exceptional Portfolios
Passing is 75+. But portfolios that score 90+ are memorable and set you apart.
An Adequate Portfolio (75-84)
- Eval program design is well-documented and implementable, but fairly standard in approach
- Published contribution is solid but has limited adoption or novelty
- Mentorship is structured and shows mentee growth, but growth is incremental
- Writing is clear but could be more compelling
You pass. You earn the credential. You can lead evaluation programs. But you're not making headlines.
An Exceptional Portfolio (90-100)
- Eval program design is sophisticated, addressing a complex problem with novel metrics or methodology
- Published contribution is widely adopted (1000+ adopters/readers), cited by others, or influential
- Mentees have become leaders themselves; mentorship is reproducible and scalable
- Writing is exceptional; ideas flow naturally; evidence is compelling
You pass with distinction. You're considered a thought leader. You get recruited for advisor roles, speaking invitations, consulting opportunities.
6-Week Preparation Roadmap
If you're starting from scratch, here's a realistic timeline:
Week 1: Artifact Selection
- Identify a strong eval program design you've led
- Identify a publication opportunity or existing published work
- Identify 1-2 people you've mentored
- If missing any, plan how to create it (this may take longer than 6 weeks)
Weeks 2-3: Eval Program Design
- Write first draft (15-20 pages)
- Get feedback from a peer
- Revise
Week 4: Published Contribution
- If you have existing work, write a 2-page summary highlighting impact
- If you don't, accelerate your blog post or paper (this often extends timeline)
Week 5: Mentorship Evidence
- Compile session logs and mentee progress
- Write narrative (10-15 pages)
- Collect mentee testimonials
Week 6: Polish & Submit
- Proofread all artifacts
- Write cover letter
- Get review from mentor or peer
- Submit
Reality check: If you don't have published work or structured mentorship yet, add 3-6 months to this timeline. Don't rush creating artifacts; quality matters more than speed.
What "Industry-Level" Contribution Means
For the published contribution, the bar is "industry-level." What does this mean exactly?
Not Industry-Level
- Blog post on your company's internal blog, not public
- Tool used only within your org
- Tutorial on evaluation (how-to, not novel)
- Preprint posted to arXiv but not published
- Conference talk that's a case study (valuable but not novel methodology)
Industry-Level
- Published blog post on a major platform (eval.qa, Medium, ACL Anthology) with 1000+ reads
- Open-source tool with 500+ GitHub stars and active community
- Peer-reviewed paper published at major venue
- Keynote or 30+ minute technical talk at major conference
- Research framework or methodology that others adopt and build on
The key test: Would someone outside your organization find this valuable? If yes, it's industry-level.
Anonymization and Confidentiality
Some portfolios contain proprietary information. You can anonymize while keeping substance.
What You Can Anonymize
- Company name ("Company X," "FinServ Org," etc.)
- Specific revenue or user numbers (say "millions of users" instead of "47.3M")
- Proprietary metrics (describe the concept, not the exact calculation)
- Names of team members
- Specific model architectures or training data
What You Cannot Anonymize (or it weakens the submission)
- Industry or domain ("healthcare," "fintech," etc. — this is important context)
- Problem scale (saying "small problem" vs. "affects millions of users" matters)
- Impact metrics (if you can't share the numbers, describe the direction at least)
- The fact that the program was executed (don't claim implementation if proprietary details prevent proof)
Fill out the anonymization form and the review panel will keep it confidential. You won't be penalized for reasonable anonymization.
The Revision Process
If you receive "Revision Required," you're not rejected. You have a clear path to pass.
The panel will specify what needs strengthening:
- Major revisions: Typically mentorship evidence or problem framing. Requires 4-6 weeks to address.
- Minor revisions: Typically clarity or additional examples. Requires 1-2 weeks.
You have 12 weeks to resubmit revised artifacts. The same panel reviews your revision.
Revision Success Rate
Revision is not a death sentence. Most who revise successfully pass.
FAQ: 15+ Common Questions
Q: Can I use internal work for my eval program design if it's been published externally (e.g., as a case study)?
A: Yes. As long as you have permission and anonymize proprietary information appropriately, you can use real work.
Q: Can I use the same published contribution for another credential (like a portfolio for an academic PhD or teaching role)?
A: Yes. A piece of work can serve multiple purposes. You don't need separate publications.
Q: What if I co-authored my published work? Does that disqualify me?
A: No. Co-authorship is fine. Specify your contribution (1st author, 2nd author, equal contribution, etc.). You should be able to speak to your specific role.
Q: Can I use a product I built that's not open-source as my published contribution?
A: It depends on impact. If it's closed-source, the bar is higher: documented case studies, user testimonials, clear evidence of adoption. Open-source projects get credit for transparency and community.
Q: If I mentor someone who fails their certification exam, does that hurt my portfolio?
A: No. Your job is to develop them; their job is to pass the exam. Mentee growth is measured by progression, not exam results.
Q: How do I find someone to mentor if I don't already have mentees?
A: You have options: (1) Mentor an existing junior evaluator at your org. (2) Volunteer as a mentor through eval.qa's mentorship program. (3) Find someone in your professional network interested in learning evaluation. You have time to find them; don't wait.
Q: Can I mentor someone remotely?
A: Yes. Document your sessions (Zoom, Slack, email). Remote mentorship is just as valid as in-person.
Q: Is 6 weeks enough to mentor someone if I'm just starting?
A: No. You should have 6+ months of mentoring history by the time you submit. Plan for this when you're starting your eval journey.
Q: Can my portfolio evaluation program be the same as my day job?
A: Yes. If you designed and led an evaluation program at work, that's valid. You don't need side projects.
Q: If my eval program design was rejected by my organization for implementation, can I still use it in my portfolio?
A: You can, but frame it as a "design proposal" not "executed program." The panel will ask why it wasn't implemented. Be honest. If it was rejected for budget reasons, that's understandable. If it was rejected because the design was flawed, use something else.
Q: How much detail should I include if I'm anonymizing?
A: Enough detail that an expert evaluator can understand the sophistication of your approach. If anonymization makes it too vague, you've anonymized too much.
Q: Can I update my portfolio after submission (before the panel reviews it)?
A: You have 5 days after submission to make minor updates (clarifications, broken links, etc.). After that, it's locked for review.
Q: What if I think a reviewer was unfair in their assessment?
A: You can appeal. You'll get a second review by a different panel member. Appeals are rare but possible if you can point to specific errors in assessment.
