L5 Commander Portfolio Requirements: Everything You Need to Know

What the Portfolio Is and Why It Exists

The L5 Commander portfolio exists because passing the exam is not enough to demonstrate strategic leadership in AI evaluation. The exam tests knowledge. The portfolio tests judgment, execution, and impact.

You cannot reach Commander level by pure study. You must demonstrate that you can:

Design evaluation programs that solve real problems (Artifact 1)
Contribute to the field and advance industry thinking (Artifact 2)
Develop other evaluators and spread expertise (Artifact 3)

The portfolio is evaluated by a panel of 3-5 expert evaluators, not just the automated exam system. They look for evidence of strategic thinking, not just technical competence.

68%

Pass rate on L5 exam (written knowledge)

42%

Pass rate on portfolio (demonstrated impact)

1,200 hours

Average preparation time for L5 (exam + portfolio)

The portfolio is harder than the exam. This is intentional. The Commander credential should be rare and valuable.

The 3 Mandatory Artifacts

You must submit all three artifacts. You cannot skip one or "compensate" with a stronger version of another. The portfolio is holistic.

Artifact	Scope	Format	Target Length	Weight
Eval Program Design	Design for a real evaluation initiative	Document (PDF or markdown)	15-25 pages	35%
Published Contribution	Public intellectual contribution	Blog, paper, talk, or tool	Varies	35%
Mentorship Evidence	Documentation of mentoring 1-2 evaluators	Structured documentation	10-15 pages + artifacts	30%

Artifact 1: Eval Program Design (35% weight)

This is a comprehensive design document for an evaluation program. Not a proposal; a design. The program should be:

Real: Either implemented or implementable. Not hypothetical.
Significant: Addressing a material business or technical problem.
Your design: You led it, not just contributed to it.

Required Sections

1. Problem Statement (2-3 pages)

What problem does this evaluation solve? Examples:

"Our recommendation engine has unknown performance across user segments. We ship features without knowing if they help or hurt different cohorts."
"Our medical AI system is deployed but we have no systematic safety monitoring. A failure could harm patients."
"Our code generation model powers 10K developers. We don't know if it's getting worse (model drift). This program establishes continuous monitoring."

The problem statement should feel urgent. Why does this matter now? What's the risk of inaction?

2. Evaluation Architecture (3-4 pages)

High-level design of your evaluation approach:

What systems are being evaluated? (specific models, not vague)
What are the key metrics? (at least 5-7, with justification for each)
Who evaluates? (humans, AI judges, automated)
What's the evaluation frequency? (continuous, monthly, quarterly)
How do you handle edge cases? (long-tail scenarios, rare events)

3. Governance and Ownership (2 pages)

Who is responsible for what? Examples:

Product team owns evaluation budget
ML Engineering runs the evaluation pipeline
Ethics committee reviews safety evaluation results
Executive steering committee sets go/no-go for deployment based on eval results

Include a RACI matrix (Responsible, Accountable, Consulted, Informed) for key decisions.

4. Rollout Timeline (2-3 pages)

Month-by-month or phase-by-phase plan. Example:

Weeks 1-4: Assemble team, define metrics, build evaluation dataset
Weeks 5-8: Run baseline evaluation on current system
Weeks 9-12: Implement automated evaluation pipeline
Weeks 13-16: Deploy to production, integrate with CI/CD
Month 5+: Monitor, iterate, expand to related systems

5. Budget and Resource Requirements (2 pages)

Be specific:

Staffing: 1 FTE evaluator, 0.5 FTE engineer for tooling
Tools: $50K/year for LangSmith + human annotation platform
Data: $20K for creating evaluation dataset
Training: $10K for team training
Total: $150K year 1, $80K year 2+

Frame it as investment, with expected ROI.

6. Success Criteria and Measurement (2 pages)

How will you know the evaluation program succeeded? Examples:

Process metrics: Evaluation runs automatically every week; 95%+ success rate
Business metrics: 40% reduction in post-deployment issues; 2-week faster feature delivery
Quality metrics: Evaluators achieve >85% inter-rater reliability

Quality Criteria for Artifact 1

Evaluation perspective

Reviewers ask: "If I handed this to a new team, could they implement it?" If the answer is yes, it's well-written. If they'd need to come back to you with questions, it needs clarity.

Criterion	Acceptable (70+)	Strong (85+)	Exceptional (95+)
Problem clarity	Problem is clear; urgency is implied	Problem is clear and urgent; quantified	Problem is quantified with stakeholder validation
Architecture feasibility	Design is reasonable; some gaps exist	Design is detailed and implementable	Design is implementable with evidence of piloting
Metric selection	Metrics are chosen; some lack justification	All metrics are justified; align with problem	Metrics are justified, weighted, and validated
Governance clarity	Ownership is defined but vague	Clear ownership with decision rights	RACI matrix; escalation paths defined
Timeline realism	Timeline is provided but aggressive	Timeline is realistic with milestones	Timeline based on actual implementation experience

Artifact 2: Published Contribution (35% weight)

You must make a public intellectual contribution to the field of AI evaluation. This is not internal documentation; it's external, for the community.

What Counts as Published Contribution

Technical blog post: 3,000+ words on eval.qa blog, Medium, or company blog (publicly accessible)
Open-source tool: GitHub repo with 500+ stars, solving a real eval problem
Academic paper: Published in a peer-reviewed venue (e.g., ACL, NeurIPS workshops, FAccT)
Conference talk: 30+ minute technical talk at a major conference (NeurIPS, ICML, TMLR, Anthropic LLM Day, etc.)
Research methodology: Documented framework/methodology that others can apply
Benchmark contribution: Created or significantly improved a public evaluation benchmark

Quality Bar: What Doesn't Count

Internal-only documentation (not published)
Company blog post that's just marketing (not technical depth)
Open-source tool with 50 stars and no usage (not adopted)
Conference talk on a basic topic without novel insight
Anything that's a rehash of existing work without new contribution

Examples of Strong Contributions

Blog Post Example: "Multi-Dimensional Evaluation of RAG Systems: Beyond BLEU Scores" — 4,500 words, published on eval.qa blog, 5,000+ reads, cited by others

Tool Example: "EvalMetrics" — Open-source Python library for domain-specific evaluation metrics, 800 GitHub stars, 50 organizations using it

Paper Example: "Causal Inference in AI Evaluation: Separating Correlation from Impact" — Published at FAccT 2025, 20+ citations

Talk Example: "Building Evaluation Culture: A 500-Person Case Study" — 35-minute technical talk at NeurIPS 2025, 200+ attendees

Minimum Thresholds

Blog: 3,000+ words, 1,000+ views, 6+ months publicly available
Tool: 500+ GitHub stars, documented, used by 10+ organizations, active maintenance
Paper: Published (preprint alone is insufficient), peer-reviewed venue
Talk: 30+ minutes, major conference, recorded and publicly available

If you're borderline (e.g., blog with 2,000 words), you may be asked to strengthen it before final acceptance.

Artifact 3: Mentorship Evidence (30% weight)

Strategic leadership means developing others. This artifact documents your mentorship of 1-2 evaluators.

What Mentorship Means

Not giving a presentation or writing a tutorial. Actual 1-on-1 development of another evaluator, where they:

Progress from one level to the next (e.g., L2 to L3)
Can design and run evaluations independently after mentorship
Cite you and acknowledge your guidance

Required Documentation

For each mentee, provide:

Initial assessment: Where they started (skill level, role, background)
Learning plan: What you committed to teach them (6-month roadmap)
Session log: Month-by-month summary of sessions:
- Topics covered
- Challenges they faced
- Progress made
Evidence of growth:
- Evaluation they designed independently
- Blog post they wrote
- Certification they earned
- Project they led
Mentee testimonial: How they benefited (1-2 paragraphs in their own words)
Mentee letter of recommendation: Optional but powerful

Quality Examples

Strong mentorship evidence:

I mentored Sarah, a junior ML engineer with 2 years experience. She had zero evaluation background. Over 6 months, I guided her through:

Month 1: Basics of eval, human evaluation design, annotation rubrics

Month 2: Designing eval for her team's recommendation system

Month 3: Running first evaluation, interpreting results

Month 4: Automation, building eval pipeline

Month 5-6: Leading eval for a new project, mentoring a junior intern herself

She now leads evaluation for 2 ML systems independently. She passed the L2 evaluation exam and is a core member of our eval center of excellence.

Weak mentorship evidence:

I mentored 5 people in evaluation. I gave them access to my notes and recommended they take the eval.qa courses. They were interested in learning about evaluation metrics.

The second example is not mentorship; it's resource sharing. Mentorship is intentional, structured, and demonstrates growth.

Portfolio Evaluation Rubric

Your portfolio is graded on a 0-100 scale across 5 dimensions. You need 75+ overall to pass.

Dimension	Excellent (95-100)	Strong (85-94)	Adequate (75-84)	Weak (<75)
Strategic Thinking	Design solves a critical problem; shows sophisticated understanding of org context	Design solves a real problem; good understanding of constraints	Design solves a defined problem; some gaps in sophistication	Problem is vague; limited strategic insight
Execution Quality	Evidence of successful execution; metrics show documented impact	Design is detailed and implementable; some execution evidence	Design is clear; limited execution evidence	Design lacks clarity or feasibility
Field Contribution	Published work that's novel and widely adopted; 1000s cite or use it	Published work that's solid; adopted by meaningful audience	Published work that's competent; limited adoption	Work is not published or published but barely adopted
Leadership & Mentorship	Mentees have progressed multiple levels; are now mentoring others	Mentees have progressed 1-2 levels; are independent evaluators	Mentees have gained skills; limited independence demonstrated	Mentorship is not structured or evidence is weak
Communication	All artifacts are exceptionally well-written; ideas are crystalline	Artifacts are well-written; ideas are clear	Artifacts are readable; some clarity issues	Artifacts are hard to follow; unclear writing

Timeline and Submission Process

Submission Windows

Portfolios are reviewed in three windows per year:

Window 1: January-February (deadline Jan 31)
Window 2: May-June (deadline May 31)
Window 3: September-October (deadline Sept 30)

Panels review and respond within 4 weeks of the deadline. You'll hear: Accepted, Revision Required, or Rejected.

Format Requirements

Eval Program Design: PDF or markdown, max 30 pages
Published Contribution: Link to public URL (blog, GitHub, paper, etc.) + 2-page summary
Mentorship Evidence: PDF with structure as described above
Cover letter: 1 page describing your portfolio
Anonymization form: Indicates any proprietary information that needs redaction (see below)

All files must be submitted via the eval.qa portal. Max file size: 100MB total.

Common Rejection Reasons (and How to Avoid Them)

28%

Rejected for weak eval program design

22%

Rejected for insufficient published contribution

31%

Rejected for weak mentorship evidence

19%

Rejected for poor writing/clarity

Top Rejection Reasons

1. "Problem Statement Is Too Vague" (12% of rejections)

Weak: "We need better evaluation of our AI models."

Strong: "Our recommendation engine serves 50M users across 3 geographies. We found a 40% performance gap between desktop and mobile users, but only after we deployed a new ranking algorithm. We need systematic evaluation to catch such gaps before deployment."

Fix: Be specific. Use numbers. Describe the failure mode.

2. "Design Is Proposed, Not Implemented" (14%)

Acceptable: "We designed and implemented this evaluation program. It's now running in production."

Weak: "We proposed this evaluation framework, but it was never implemented."

Fix: If it's not implemented, provide a detailed plan for implementation and pilot it yourself if possible. Or choose a different project that you've actually executed.

3. "Contribution Is Too Niche or Too Basic" (18%)

Too niche: "I wrote a blog post on evaluating small language models for 3-word summaries." (Audience: ~10 people)

Too basic: "I wrote a blog post explaining what BLEU scores are." (Competent but not novel)

Strong: "I developed a new methodology for counterfactual evaluation of recommendation systems, published at FAccT, now used by 15+ companies."

Fix: Choose a contribution that's both novel and relevant to the broader community. Aim for 1000+ potential readers/users.

4. "Mentorship Is Not Structured" (16%)

Weak: "I helped several people learn about evaluation. They found it helpful."

Strong: "I mentored Alice from L1 to L3 over 6 months. We had bi-weekly sessions. She now leads evaluation for 2 products and mentors others."

Fix: Focus on 1-2 mentees. Document everything. Get their written feedback. Show concrete outcomes.

5. "Unclear Writing; Hard to Follow" (11%)

Reviewers are expert evaluators, but they shouldn't need to hunt for meaning in your writing.

Fix: Have someone outside your field review it. Are ideas clear? Do claims have evidence? Is the structure logical?

Adequate vs. Exceptional Portfolios

Passing is 75+. But portfolios that score 90+ are memorable and set you apart.

An Adequate Portfolio (75-84)

Eval program design is well-documented and implementable, but fairly standard in approach
Published contribution is solid but has limited adoption or novelty
Mentorship is structured and shows mentee growth, but growth is incremental
Writing is clear but could be more compelling

You pass. You earn the credential. You can lead evaluation programs. But you're not making headlines.

An Exceptional Portfolio (90-100)

Eval program design is sophisticated, addressing a complex problem with novel metrics or methodology
Published contribution is widely adopted (1000+ adopters/readers), cited by others, or influential
Mentees have become leaders themselves; mentorship is reproducible and scalable
Writing is exceptional; ideas flow naturally; evidence is compelling

You pass with distinction. You're considered a thought leader. You get recruited for advisor roles, speaking invitations, consulting opportunities.

6-Week Preparation Roadmap

If you're starting from scratch, here's a realistic timeline:

Week 1: Artifact Selection

Identify a strong eval program design you've led
Identify a publication opportunity or existing published work
Identify 1-2 people you've mentored
If missing any, plan how to create it (this may take longer than 6 weeks)

Weeks 2-3: Eval Program Design

Write first draft (15-20 pages)
Get feedback from a peer
Revise

Week 4: Published Contribution

If you have existing work, write a 2-page summary highlighting impact
If you don't, accelerate your blog post or paper (this often extends timeline)

Week 5: Mentorship Evidence

Compile session logs and mentee progress
Write narrative (10-15 pages)
Collect mentee testimonials

Week 6: Polish & Submit

Proofread all artifacts
Write cover letter
Get review from mentor or peer
Submit

Reality check: If you don't have published work or structured mentorship yet, add 3-6 months to this timeline. Don't rush creating artifacts; quality matters more than speed.

What "Industry-Level" Contribution Means

For the published contribution, the bar is "industry-level." What does this mean exactly?

Not Industry-Level

Blog post on your company's internal blog, not public
Tool used only within your org
Tutorial on evaluation (how-to, not novel)
Preprint posted to arXiv but not published
Conference talk that's a case study (valuable but not novel methodology)

Industry-Level

Published blog post on a major platform (eval.qa, Medium, ACL Anthology) with 1000+ reads
Open-source tool with 500+ GitHub stars and active community
Peer-reviewed paper published at major venue
Keynote or 30+ minute technical talk at major conference
Research framework or methodology that others adopt and build on

The key test: Would someone outside your organization find this valuable? If yes, it's industry-level.

Anonymization and Confidentiality

Some portfolios contain proprietary information. You can anonymize while keeping substance.

What You Can Anonymize

Company name ("Company X," "FinServ Org," etc.)
Specific revenue or user numbers (say "millions of users" instead of "47.3M")
Proprietary metrics (describe the concept, not the exact calculation)
Names of team members
Specific model architectures or training data

What You Cannot Anonymize (or it weakens the submission)

Industry or domain ("healthcare," "fintech," etc. — this is important context)
Problem scale (saying "small problem" vs. "affects millions of users" matters)
Impact metrics (if you can't share the numbers, describe the direction at least)
The fact that the program was executed (don't claim implementation if proprietary details prevent proof)

Fill out the anonymization form and the review panel will keep it confidential. You won't be penalized for reasonable anonymization.

The Revision Process

If you receive "Revision Required," you're not rejected. You have a clear path to pass.

The panel will specify what needs strengthening:

Major revisions: Typically mentorship evidence or problem framing. Requires 4-6 weeks to address.
Minor revisions: Typically clarity or additional examples. Requires 1-2 weeks.

You have 12 weeks to resubmit revised artifacts. The same panel reviews your revision.

Revision Success Rate

78%

Of revision-required submissions pass on second attempt

45%

Of outright rejections that reapply pass (typically much stronger second portfolio)

Revision is not a death sentence. Most who revise successfully pass.

FAQ: 15+ Common Questions

Q: Can I use internal work for my eval program design if it's been published externally (e.g., as a case study)?

A: Yes. As long as you have permission and anonymize proprietary information appropriately, you can use real work.

Q: Can I use the same published contribution for another credential (like a portfolio for an academic PhD or teaching role)?

A: Yes. A piece of work can serve multiple purposes. You don't need separate publications.

Q: What if I co-authored my published work? Does that disqualify me?

A: No. Co-authorship is fine. Specify your contribution (1st author, 2nd author, equal contribution, etc.). You should be able to speak to your specific role.

Q: Can I use a product I built that's not open-source as my published contribution?

A: It depends on impact. If it's closed-source, the bar is higher: documented case studies, user testimonials, clear evidence of adoption. Open-source projects get credit for transparency and community.

Q: If I mentor someone who fails their certification exam, does that hurt my portfolio?

A: No. Your job is to develop them; their job is to pass the exam. Mentee growth is measured by progression, not exam results.

Q: How do I find someone to mentor if I don't already have mentees?

A: You have options: (1) Mentor an existing junior evaluator at your org. (2) Volunteer as a mentor through eval.qa's mentorship program. (3) Find someone in your professional network interested in learning evaluation. You have time to find them; don't wait.

Q: Can I mentor someone remotely?

A: Yes. Document your sessions (Zoom, Slack, email). Remote mentorship is just as valid as in-person.

Q: Is 6 weeks enough to mentor someone if I'm just starting?

A: No. You should have 6+ months of mentoring history by the time you submit. Plan for this when you're starting your eval journey.

Q: Can my portfolio evaluation program be the same as my day job?

A: Yes. If you designed and led an evaluation program at work, that's valid. You don't need side projects.

Q: If my eval program design was rejected by my organization for implementation, can I still use it in my portfolio?

A: You can, but frame it as a "design proposal" not "executed program." The panel will ask why it wasn't implemented. Be honest. If it was rejected for budget reasons, that's understandable. If it was rejected because the design was flawed, use something else.

Q: How much detail should I include if I'm anonymizing?

A: Enough detail that an expert evaluator can understand the sophistication of your approach. If anonymization makes it too vague, you've anonymized too much.

Q: Can I update my portfolio after submission (before the panel reviews it)?

A: You have 5 days after submission to make minor updates (clarifications, broken links, etc.). After that, it's locked for review.

Q: What if I think a reviewer was unfair in their assessment?

A: You can appeal. You'll get a second review by a different panel member. Appeals are rare but possible if you can point to specific errors in assessment.

L5 Commander Portfolio: Everything You Need to Know

What the Portfolio Is and Why It Exists

The 3 Mandatory Artifacts

Artifact 1: Eval Program Design (35% weight)

Required Sections

Quality Criteria for Artifact 1

Artifact 2: Published Contribution (35% weight)

What Counts as Published Contribution

Quality Bar: What Doesn't Count

Examples of Strong Contributions

Minimum Thresholds

Artifact 3: Mentorship Evidence (30% weight)

What Mentorship Means

Required Documentation

Quality Examples

Portfolio Evaluation Rubric

Timeline and Submission Process

Submission Windows

Format Requirements

Common Rejection Reasons (and How to Avoid Them)

Top Rejection Reasons

Adequate vs. Exceptional Portfolios

An Adequate Portfolio (75-84)

An Exceptional Portfolio (90-100)

6-Week Preparation Roadmap

What "Industry-Level" Contribution Means

Not Industry-Level

Industry-Level

Anonymization and Confidentiality

What You Can Anonymize

What You Cannot Anonymize (or it weakens the submission)

The Revision Process

Revision Success Rate

FAQ: 15+ Common Questions

Key Takeaways

Ready to Earn Your Commander Badge?

What the Portfolio Is and Why It Exists

The 3 Mandatory Artifacts

Artifact 1: Eval Program Design (35% weight)

Required Sections

Quality Criteria for Artifact 1

Artifact 2: Published Contribution (35% weight)

What Counts as Published Contribution

Quality Bar: What Doesn't Count

Examples of Strong Contributions

Minimum Thresholds

Artifact 3: Mentorship Evidence (30% weight)

What Mentorship Means

Required Documentation

Quality Examples

Portfolio Evaluation Rubric

Timeline and Submission Process

Submission Windows

Format Requirements

Common Rejection Reasons (and How to Avoid Them)

Top Rejection Reasons

Adequate vs. Exceptional Portfolios

An Adequate Portfolio (75-84)

An Exceptional Portfolio (90-100)

6-Week Preparation Roadmap

What "Industry-Level" Contribution Means

Not Industry-Level

Industry-Level

Anonymization and Confidentiality

What You Can Anonymize

What You Cannot Anonymize (or it weakens the submission)

The Revision Process

Revision Success Rate

FAQ: 15+ Common Questions

Key Takeaways

Ready to Earn Your Commander Badge?

Related Lessons