Why Exemplar Portfolios Are Published

Excellent work is clear when you see it. But "excellence" in evaluation is not obvious to everyone. Publishing anonymized exemplar portfolios serves three critical purposes:

These portfolios are real, fully-scored, and representative of passing work at the L5 level.

47
average pages in passing portfolio
68%
pass on first submission
3.2
weeks average review time

Portfolio 1: The Eval Architect

Background

Senior ML Engineer at a major fintech company (12 years experience, 8 in ML/AI). Role: built the entire evaluation infrastructure for an automated underwriting system (AI-driven lending decisions). Candidate brings infrastructure expertise, regulatory knowledge, and team leadership experience.

Portfolio Score: 82/100 PASS

Final Verdict: Exceptional. Clear pathway to highly influential evaluation work.

Artifact 1: Designed a 12-System Evaluation Program for Automated Underwriting AI

What it is: A 16-page document describing the design, implementation, and validation of a comprehensive evaluation framework for assessing lending decisions across 12 different AI systems (models for credit risk, fraud detection, income verification, collateral assessment, etc.).

What scored highest:

  • Methodology transparency: Clear articulation of why each metric was chosen. For example: "We use Brier score for calibration because it penalizes confidently wrong predictions, critical in lending where miscalibration leads to defaults."
  • Tradeoff documentation: The portfolio explicitly documents tension between metrics (fairness vs. accuracy) and shows how stakeholders made principled choices. Not pretending these tradeoffs don't exist.
  • Real data and results: Actual evaluation results on real borrower data (anonymized). For example: "Model A has 0.85 AUC but 23% higher denial rate for applicants with no credit history, flagging potential fairness issue."
  • Validation against real outcomes: The framework's predictions correlated with actual loan performance (default rates, etc.). r = 0.71 between predicted risk and actual defaults. This is powerful evidence of validity.

Artifact 2: Published "Evaluating Financial AI Under SR 11-7" in Towards Data Science (2,400 reads)

What it is: A 4,000-word methodology article explaining how financial institutions should evaluate AI systems under regulatory constraints. Covers risk assessment frameworks, documentation requirements, stakeholder alignment.

What scored highest:

  • Community impact: 2,400 reads is strong for a technical article. 180+ highlights. Comments from practitioners at other fintech companies thanking them for clarity.
  • Practical clarity: The article isn't theoretical. It includes step-by-step guidance. "Here's how to implement threshold-based alerts for model drift." Not just "monitoring is important."
  • Regulatory knowledge: Deep understanding of SR 11-7 (Fed guidance on model risk management). Shows the candidate knows the legal/regulatory context, not just evaluation theory.

Artifact 3: Mentored 3 L3 Candidates Over 8 Months

What it is: Documentation of mentorship relationships with three CAEE candidates preparing for L3 labs. Includes 24 session logs (2 per month per mentee), progress tracking, final mentorship summary reports.

What scored highest:

  • Systematic approach: Mentorship wasn't ad-hoc. Clear progression: months 1-2 (foundational concepts), months 3-5 (applied projects), months 6-8 (independent work with feedback). Each mentee improved visibly.
  • Mentee outcomes: All three candidates passed their L3 labs on first attempt. Two later hired as evaluation leads at other companies. This demonstrates the quality of mentorship.
  • Documentation quality: Session notes are detailed but concise. "Covered Spearman correlation vs. Pearson; mentee misunderstood why Pearson assumes linearity. Walked through scatter plot examples. Mentee demonstrated understanding with new example."

Scoring Breakdown

  • Methodological Rigor (25/25): Evaluation design is principled and well-documented. Clear rationale for every choice. Validation against real outcomes is present.
  • Stakeholder Impact (20/20): The work directly influenced lending decisions. Artifact 2 was read by practitioners at 40+ institutions. Mentees succeeded.
  • Technical Depth (18/20): Strong knowledge of ML evaluation, regulatory context, and financial AI. Minor gap: limited coverage of adversarial robustness testing, but not critical for fintech context.
  • Communication (14/15): Portfolio is well-organized and clear. Artifacts are polished. One session note in mentorship artifact was less detailed than others, minor issue.
  • Originality (5/5): Not copying frameworks from others. The 12-system evaluation program was custom-designed for the specific business context.

Portfolio 2: The Domain Expert

Background

Clinical Informatics Specialist (15 years in healthcare, 7 in AI evaluation). Background in both clinical medicine and data science. Built evaluation frameworks for diagnostic AI systems in partnership with hospital networks and FDA.

Portfolio Score: 79/100 PASS

Final Verdict: Strong. Exceptional domain expertise; clear regulatory alignment.

Artifact 1: FDA-Aligned Evaluation Framework for Diagnostic AI

What it is: A 18-page technical document (plus 30 pages of appendices with rubrics, datasets, validation results) describing how to design and validate evaluation systems for medical image analysis AI in FDA regulatory context.

What scored highest:

  • Regulatory knowledge: Deep understanding of FDA's expectations for clinical AI. References specific FDA guidance documents (e.g., "AI/ML-Based Software as a Medical Device Action Plan"). Shows the candidate has done the homework.
  • Domain-specific rigor: The framework addresses medical-specific concerns: bias across patient demographics, failure modes in rare diseases, confidence calibration (when should the AI decline to predict). Not generic.
  • Real-world datasets: Evaluation conducted on actual diagnostic data from three hospital networks (anonymized). This is stronger than using public benchmarks.

Artifact 2: Co-Authored JAMIA Paper on Clinical AI Evaluation

What it is: A peer-reviewed paper in the Journal of the American Medical Informatics Association (strong venue, highly cited). Title: "Evaluating Clinical Diagnostic AI: Frameworks for Regulatory Compliance and Clinical Utility Assessment."

What scored highest:

  • Peer review validation: Published in a rigorous, domain-specific venue. This is stronger than a blog post. The work has been vetted by clinical informaticists and AI researchers.
  • Clinical credibility: Co-authored with leading clinicians and informaticists from major medical centers. Shows broad recognition in the field.
  • Citation impact: Published 18 months ago; already 45 citations. This work is influencing clinical AI evaluation practices.

Artifact 3: Led a 6-Person Calibration Session Group

What it is: Documentation of a 3-month calibration workshop for clinical raters evaluating diagnostic AI. 6 participants (radiologists, pathologists, informaticists). 12 sessions total, 2 hours each.

What scored highest:

  • Methodological rigor: Clear protocol: week 1 (rubric development), weeks 2-3 (calibration on gold-standard cases), weeks 4-12 (independent rating + group discussion). Inter-rater agreement (Cohen's kappa) improved from 0.61 initially to 0.84 by final week.
  • Participant feedback: Post-workshop survey: 5/6 participants reported increased confidence in their evaluation judgments. One published their own paper on diagnostic AI evaluation.
  • Documentation: Each session has a summary: what was covered, agreement metrics, insights. This is the model for how to run effective rater calibration.

Scoring Breakdown

  • Methodological Rigor (24/25): Framework is rigorous and well-validated. Calibration work is exemplary. Minor: could have included more adversarial testing (e.g., edge cases in rare diseases).
  • Domain Expertise (25/25): Exceptional. Deep clinical knowledge. Understands regulatory landscape. Knows what clinicians care about.
  • Impact (18/20): Paper has good citations. Framework is used by hospitals. Some room for broader industry impact (e.g., consulting with more institutions).
  • Communication (11/15): Portfolio is well-written overall. Some parts are clinical-jargon heavy; could be more accessible to evaluators outside healthcare. Artifact 1 appendices are repetitive.
  • Originality (5/5): The FDA-aligned framework is novel and tailored to clinical context.

Portfolio 3: The Consultant

Background

Independent AI Consultant (10 years experience, 5 as independent). Background in retail AI, supply chain optimization, and evaluation methodology. Has evaluated AI systems for 30+ client companies.

Portfolio Score: 77/100 PASS

Final Verdict: Strong. Exceptional breadth; demonstrated consulting excellence.

Artifact 1: Evaluated 8 AI Systems for a Fortune 500 Retailer (Includes DCR Excerpts)

What it is: Condensed case study (10 pages) from a larger consulting engagement. Describes evaluation of 8 different AI systems across retail operations: demand forecasting, inventory optimization, price optimization, store layout recommendation, customer retention, supply chain, product recommendation, checkout fraud detection.

What scored highest:

  • Breadth of application: The candidate evaluated AI systems across vastly different domains within one company. Shows versatility and ability to apply evaluation principles across contexts.
  • Business impact: One system (inventory optimization) resulted in 2.1% inventory reduction, saving $8.3M annually. The evaluation identified which models to deploy and which to hold. This is proof that evaluation drives real value.
  • Tradeoff clarity: For price optimization AI, clearly documented the tension between revenue optimization and margin preservation. Showed how different pricing strategies evaluated differently. Stakeholders made informed choice.
  • Honest assessment: One system was evaluated as "not ready for deployment." The candidate recommended additional work before launch. This honesty is highly valued; consultants sometimes feel pressure to say "yes" to keep clients happy.

Artifact 2: Open-Sourced a Retail AI Evaluation Framework on GitHub (340 Stars)

What it is: Public GitHub repository containing Python library for retail AI evaluation. Includes implementations of ~20 retail-specific metrics, evaluation templates, documentation, and example datasets. 340 GitHub stars, 85 forks, 3 PRs merged from community.

What scored highest:

  • Community impact: 340 stars is solid for a niche library. Used by 85+ teams (inferred from forks). This is distributed impact—evaluation work that helps many organizations.
  • Documentation quality: README is comprehensive. Examples are clear. API is intuitive. This is professional open-source work, not just code dump.
  • Maintenance: Repository actively maintained (last update 2 weeks ago). Responds to issues. Takes PRs seriously. This shows commitment.

Artifact 3: Documented 12 Mentorship Sessions with Session Agendas and Outcomes

What it is: Detailed documentation of 12 mentorship sessions with a junior AI engineer preparing for L3 evaluation lab. Each session has: agenda, what was covered, mentee questions, key takeaways, assignments for next session, attendance/progress notes.

What scored highest:

  • Documentation depth: Each session summary is 1-2 pages. This level of detail shows commitment to mentorship quality. Not just "we talked about evaluation" but specific concepts covered, misunderstandings identified, progress tracked.
  • Mentee success: The mentee passed their L3 lab on first attempt with a score of 89/100. Six months later, hired as evaluation engineer at a major AI company. Clear trajectory of improvement.
  • Structured progression: Sessions follow a clear arc: foundational concepts → applied projects → independent practice. Assignments escalate in difficulty. This is pedagogically sound.

Scoring Breakdown

  • Breadth & Impact (24/25): Exceptional breadth across 8 different AI systems. Demonstrated business impact ($8.3M saved via better evaluation). Open-source framework used by community. Minor: would benefit from one peer-reviewed publication.
  • Technical Depth (16/20): Solid technical skills. But some areas show moderate depth rather than exceptional depth. Not known for breaking new theoretical ground; more of a "apply best practices well" consultant.
  • Communication (14/15): Case study is well-written. GitHub docs are clear. Mentorship notes are detailed. Overall communication is strong. Minor: case study anonymization could be a bit more detailed (harder to verify some claims).
  • Methodological Rigor (14/20): Evaluation approaches are sound but somewhat standard. Not pushing boundaries on novel evaluation methods. Not a weakness, but room for deeper innovation.
  • Originality (5/5): Open-source framework is original. Evaluation approach for 8 systems is novel for retail context.

Common Themes in High-Scoring Portfolios

1. Original Data, Not Borrowed Claims

All three exemplar portfolios use real data, real projects, real outcomes. They don't claim evaluation best practices; they demonstrate them. This is crucial. A portfolio that says "best practice is to validate against real outcomes" is weaker than one showing: "We validated our evaluation framework against 500 loans, computed Pearson r = 0.71 with actual default rates."

2. Tradeoff Documentation

Excellence in evaluation means acknowledging that tradeoffs exist. You cannot optimize for everything simultaneously. The exemplar portfolios are explicit: "We prioritized fairness over raw accuracy" or "We optimized for regulatory compliance over user experience." They don't pretend these tensions disappear.

3. Stakeholder Impact Articulated

All three portfolios clearly articulate who benefited from the evaluation work and how. "This evaluation framework was used by 3 hospitals and 8 consulting clients." "The work influenced lending decisions on $500M+ in applications." "The open-source library is used by 85+ teams." Impact is measured, not assumed.

4. Methodology Transparency

You can understand exactly how the evaluation was done and why. Rubrics are shown. Metrics are justified. Choice rationale is clear. A reader could reproduce the work (with the same data). This transparency is what separates "evaluation that works" from "evaluation that might work."

Portfolios That Failed and Why

Failed Portfolio 1: The Vaporware

Issue: Artifact 1 was a theoretical framework paper ("A Principled Approach to Evaluating Multimodal LLMs") with no implementation or real data. Artifact 2 was a blog post with high-level best practices but no novel insights. Artifact 3 was documentation of mentorship sessions that were largely just pointing people to existing resources, not developing new evaluation thinking. Final score: 41/100. Feedback: "Your work is competent but not at Commander level. You're synthesizing existing knowledge, not creating new knowledge or impact."

Failed Portfolio 2: The Black Box

Issue: Candidate worked at a major AI company on proprietary evaluation systems. But the portfolio couldn't disclose details due to NDA. So Artifact 1 was "I built an evaluation system (can't show it)." Artifact 2 was vague. Artifact 3 was a general description of mentorship without specific examples or metrics. Final score: 38/100. Feedback: "We cannot assess your work if you cannot describe it. Work with your employer on disclosure or submit different work. Evaluation is about transparency and rigor; your portfolio doesn't demonstrate either."

Failed Portfolio 3: The Grade Inflation

Issue: Candidate submitted evaluation work that all scored their own models as excellent. Artifact 1 described an evaluation system where all models got 8+/10 scores. Artifact 2 was a paper showing their own evaluation methodology always validated models. Artifact 3 was mentoring that consisted of giving positive feedback without critical assessment. Final score: 35/100. Feedback: "Your evaluation lacks rigor. Exceptional models exist, but when everything is excellent, you're not discriminating. Evaluation must be willing to say 'this is not good enough.'"

Failed Portfolio 4: The Narrow Specialist

Issue: Candidate was exceptionally deep in one narrow area (BERTScore variants) but showed no breadth. All three artifacts were about the same topic. No evidence of adaptability or systems thinking. Final score: 44/100. Feedback: "Commanders evaluate AI systems holistically. While your technical depth is impressive, L5 requires breadth. Show me evaluation across different domains, tradeoff thinking, stakeholder management. This portfolio shows strength but not Commander-level versatility."

Failed Portfolio 5: The Résumé

Issue: The portfolio was essentially a resume with three bullet points per artifact. No detailed explanation, no tradeoff discussion, no real data, no documented impact. The word count was 3,000 when target is 10,000+. Final score: 32/100. Feedback: "A portfolio is not a resume summary. We need depth. Show us your thinking. Explain your tradeoff choices. Provide evidence of impact. This portfolio lacks the substance we need to assess your mastery."

Self-Assessment Checklist: Rate Your Own Portfolio

Before submitting, score yourself honestly on these 12 dimensions:

  1. Original work (yes/no): Is this evaluation work you personally conducted, not work synthesized from others? Can you defend every claim?
  2. Real data (yes/no): Does the portfolio use real data, real projects, real outcomes? Or is it theoretical/synthetic?
  3. Tradeoff articulation (yes/no): Do you explicitly acknowledge tensions and show how you resolved them? Or do you pretend tradeoffs don't exist?
  4. Methodological rigor (yes/no): Could someone reproduce your evaluation from your documentation? Are methodological choices justified?
  5. Impact measured (yes/no): Can you quantify the impact of your work? Readers, citations, stakeholder feedback, business outcome?
  6. Stakeholder clarity (yes/no): Is it clear who benefited from your work and how? Or vague about impact?
  7. Breadth (yes/no): Does your portfolio show versatility across multiple evaluation contexts? Or depth in one narrow area?
  8. Depth (yes/no): Do you demonstrate deep mastery in your domain, with nuanced understanding? Or surface-level competence?
  9. Communication (yes/no): Is the portfolio well-organized, clear, and professional? Would a non-expert understand your work?
  10. Honesty (yes/no): Do you acknowledge limitations and failures? Or present only positive outcomes?
  11. Word count (yes/no): Is your portfolio 10,000+ words (40+ pages)? Under 8,000 words, it's too brief for Commander-level assessment.
  12. Artifact quality (yes/no): Are your three artifacts of genuinely different types (e.g., methodology + publication + mentorship), not variations on a theme?

Scoring: Count your "yes" answers. 10+ yes answers: you're in good shape. 7-9: strengthen weaker areas before submitting. <7: significant work needed; consider deferring submission.

How Reviewers Score Portfolios

Panel Composition

Each portfolio is reviewed by a 3-person panel: (1) an eval.qa faculty member (evaluation methodology expert), (2) an industry Commander (peer who has passed L5), (3) a domain expert aligned with the portfolio's domain (if possible).

All three panelists score independently before discussion. Average of the three scores is the final portfolio score.

Scoring Rubric

Each panelist scores on five dimensions (0-25 points each):

  • Methodological Rigor: Is evaluation methodology principled, well-justified, and reproducible?
  • Impact & Stakeholder Value: Does this work matter? Who benefited? Can you quantify the impact?
  • Technical Depth: Does the candidate demonstrate deep mastery of evaluation concepts relevant to their work?
  • Communication: Is the portfolio well-organized, clear, and professional?
  • Originality: Does this work create new knowledge or push evaluation practice forward?

Deliberation Process

After independent scoring, the three panelists discuss. Typical discussion topics:

  • Are there large score discrepancies? If one panelist scored 85 and another 60, discuss why and reconcile.
  • Does the portfolio represent Commander-level mastery? What would need to change to move it from 75 to 90?
  • Are there concerns not captured by the rubric? (E.g., does this person seem defensive when challenged?)
  • What feedback would help this candidate if they revise and resubmit?

Consensus is reached. If panel cannot reach consensus (e.g., two vote pass, one votes no), escalated to the eval.qa Executive Committee for final determination (rare).

Formatting Guidelines and Submission Requirements

Length Expectations

  • Total portfolio: 10,000-15,000 words (40-50 pages) is target. Minimum 8,000 words. Maximum 25,000 words (brevity + depth valued over verbosity).
  • Artifact 1: 4,000-6,000 words expected (detailed project or methodology documentation).
  • Artifact 2: 2,000-4,000 words (publication, paper, or contribution). If peer-reviewed, include copy of published version.
  • Artifact 3: 2,000-4,000 words (mentorship, leadership, or community contribution). If mentorship, include session logs.
  • Portfolio statement: 500-1,000 words tying the three artifacts together, articulating how they demonstrate Commander-level mastery.

File Formats

  • Preferred: PDF (formatted professionally, readable, preserves formatting)
  • Acceptable: Google Docs (shareable), Word (.docx)
  • Not acceptable: Markdown, plain text, screenshots, videos (unless video is embedded in PDF)

What to Include

  • Your name, email, date of submission
  • Title and brief abstract (1 sentence) for each artifact
  • Table of contents (if >40 pages)
  • Clear section headers and numbering
  • References and citations (follow any style; consistency matters more than specific style)
  • Appendices if needed (rubrics, datasets, detailed metrics)
  • Links to public artifacts (GitHub, published papers, blog posts) where applicable

What to Exclude

  • Proprietary client data (anonymize and generalize as needed)
  • Anything under NDA (work with your employer on disclosure or choose different artifacts)
  • Identifying information about human subjects
  • Unfinished or heavily commented code (polish before inclusion)
  • Lengthy appendices that repeat content (20 pages of identical rubric variants: not useful)

The Revision Process and Timeline

First Submission → Initial Decision (3 weeks)

  1. Submit portfolio via eval.qa portal
  2. Panel appointed within 2 business days
  3. Panel reviews (typically 7-10 days per panelist)
  4. Panel deliberation (1-2 days)
  5. You receive notification: PASS, PASS with revision, or DENY

Possible Outcomes

PASS (score ≥ 70): You are certified as a Commander. Badge awarded immediately. Congrats.

PASS with revision (score 60-69): You pass, but panel recommends specific revisions to strengthen the portfolio. You have 90 days to revise and resubmit. Resubmit to the same panel (they will assess changes). If revisions are substantial, another full review (3 weeks). Most candidates who revise move to higher scores (typically 75-82).

DENY (score < 60): Does not represent Commander-level work yet. Detailed feedback provided. You can reapply after 12 months. Most candidates reapply with different artifacts.

Revision Success Rates

Of candidates who submit "PASS with revision," 87% successfully revise and move to full PASS on second submission. Average score improvement: +8 points.

Frequently Asked Questions

Can my portfolio include work from 3+ years ago?

Yes. The work must be genuinely excellent, and you must be able to speak to it in detail. Most strong portfolios include work from the past 2-3 years, showing recent currency. But a groundbreaking evaluation framework from 5 years ago is acceptable if its impact is documented.

How detailed must the mentorship documentation be?

Session logs should average 1-2 pages each. Enough detail to understand what was discussed and the mentee's growth trajectory. "We talked about evaluation" is too vague. "We reviewed their evaluation framework for an e-commerce AI system, identified three methodological issues (contamination risk, missing fairness metrics, insufficient human validation), worked through solutions, and the mentee implemented these improvements" is appropriate.

Can I submit a portfolio in a language other than English?

No. The evaluation community operates in English. Portfolio must be in English. (Non-English publications are okay as artifacts if you provide English summary.)

What if my artifact 2 (publication) got rejected?

Rejected submissions are not artifacts. Publish first, then include in portfolio. If you have a strong paper under review, you can note this in portfolio statement ("Paper accepted pending revisions at FAccT 2025"), but final portfolio must include final published version or it doesn't count.

Do I need to have worked at a well-known company for my work to be credible?

No. Credibility comes from rigor and impact, not employer brand. A consultant who has documented strong evaluation work at 30+ companies can have a stronger portfolio than someone at a famous company with unmeasured impact.

Can I collaborate with someone else on artifacts?

Yes. But clearly denote your contributions. "I designed the evaluation framework and led the implementation; co-author X contributed to statistical analysis." The portfolio should make your personal contribution clear. Panels need to assess your individual mastery, not your collaborator's.

Key Takeaways

  • Exemplar portfolios demonstrate that L5 excellence involves rigorous methodology, real data, documented impact, and breadth across evaluation contexts.
  • High-scoring portfolios use original data, articulate tradeoffs, measure stakeholder impact, and demonstrate transparency.
  • Common failures: vaporware (theory without implementation), black boxes (can't disclose work), grade inflation (everything is excellent), narrow specialization, and insufficient depth.
  • The 12-point self-assessment checklist helps you evaluate your own portfolio before submission.
  • The review process is rigorous: 3-person panel, 5 scoring dimensions, deliberation for consensus. Average review time 3-3.5 weeks.
  • Portfolios should be 10,000-15,000 words, PDF format, with three distinct artifacts and a synthesizing statement.
  • 68% pass on first submission. Those with "PASS with revision" have 90 days to improve; 87% successfully revise to full pass.

Start Building Your Commander Portfolio

Excellence is visible when you know what to look for. Study these exemplars. Self-assess your work. Revise. Submit. The eval.qa community needs more Commanders setting the standard for rigorous, impactful AI evaluation.

View L5 Commander Requirements