The Philosophy: Giving Back to the Field
The L5 Commander certification requires more than demonstrated mastery of evaluation. It requires advancing the field. A Commander has an obligation to give back to the community that elevated them. This is not charity; it's enlightened self-interest. When you improve the practices of others, you improve the entire landscape of AI evaluation. Higher quality evaluation everywhere benefits everyone.
This is why every L5 candidate must document at least one significant industry contribution—work that has value beyond their own organization and demonstrates leadership in the evaluation community.
Eight Accepted Contribution Types with Quality Criteria
1. Published Methodology (Blog, Medium, Substack)
What it is: A written methodology article explaining evaluation approach, framework, or findings. Published on a public platform (Towards Data Science, your personal blog, company blog, Substack, LinkedIn, etc.).
Minimum requirements:
- 2,000+ words
- At least one novel contribution (new method, new finding, new framework, new perspective)
- Real data and examples, not just theory
- Clear implementation guidance (readers could apply your method)
- Published publicly (not paywalled or private)
- Minimum 500 readers/views (evidence of reach)
- Authored or co-authored by you; your contributions clearly stated
Quality indicators: Comments/responses from community, citations in subsequent work, external validation. A post with 50 comments is stronger evidence of impact than one with 2.
Examples: "Why BLEU Is Broken for Modern NLP," "Evaluating Multimodal AI: A Framework for Vision-Language Models," "Contamination Detection: Finding Training Data Leakage in Your Evaluation."
2. Open-Source Framework or Library
What it is: A public GitHub repository (or equivalent) containing evaluation tools, frameworks, or libraries that others can use. Must be genuinely useful and well-maintained.
Minimum requirements:
- Functional, tested code (not a proof-of-concept)
- Clear README with usage examples
- MIT/Apache/GPL license (permissive)
- Minimum 50 stars (evidence of adoption)
- Active maintenance (response to issues in <2 weeks)
- Test coverage >80% (not untested code)
- Documentation for key functions/classes
- You are primary author or lead maintainer
Quality indicators: Forks, external contributions, citations in other projects, community engagement. A library with 5 external PRs merged is stronger than one with 0.
Examples: eval-suite (comprehensive evaluation harness), calibration-tools (rater calibration framework), contamination-detector (checks for data leakage), metric-garden (collection of novel metrics).
3. Conference Presentation (Recorded)
What it is: A talk at a technical conference with video recording, covering novel evaluation methodology or findings. Can be a workshop, main track, or invited talk.
Minimum requirements:
- Conference with selective acceptance (impact factor > 0.3)
- 30+ minute presentation
- Video recording publicly available
- Novel content (new method, new findings, new perspective)
- Audience >30 people
- Real data and examples, not just theory
- You are primary speaker or co-speaker with clear contribution
Accepted venues: NeurIPS, ICML, ICLR, EMNLP, ACL, FAccT, AIES, major industry conferences (KDD, SIGMOD, etc.), specialized workshops at these venues.
Quality indicators: Views on video recording, citations of your talk by others, follow-up engagement. A talk with 800+ views is stronger than one with 30.
4. Standards Body Comment (NIST, EU AI Office, IEEE)
What it is: Formal comment submitted to standards body or regulatory agency on AI evaluation policy, requirements, or guidance. Publicly documented.
Minimum requirements:
- Submitted during formal comment period (RFC, notice of proposed rulemaking, etc.)
- 1,000+ words with substantive technical content
- Original analysis or novel perspective on evaluation
- Publicly available (searchable in comment database)
- Demonstrates understanding of regulatory landscape
- You are primary author; contributions clearly stated if co-authored
Qualifying bodies: NIST (AI Safety Institute, ARIA, AISI), EU AI Office, IEEE Std 3119, ISO 42001, FTC guidance processes.
Quality indicators: Citation by regulators, influence on final guidance, response from other commenters, media coverage. Evidence that your comment shaped policy is very strong.
5. Peer-Reviewed Paper
What it is: Academic paper published in peer-reviewed journal or conference proceedings. Advances evaluation methodology or findings.
Minimum requirements:
- Accepted and published in conference proceedings or journal
- Venue is selective (acceptance rate <50%)
- Original research (novel method, novel dataset, novel findings)
- Evaluation is core contribution (not just an application)
- You are first author or co-author with clearly-stated contribution
- Paper is publicly available (preprint at minimum)
Strong venues: NeurIPS, ICML, ICLR, EMNLP, ACL, FAccT, AIES, JAMIA (clinical), IEEE TSE (software evaluation).
Quality indicators: Citation count, follow-up work citing your paper, replication of your methods, media coverage. A paper with 20+ citations is stronger than one with 0.
6. Workshop or Training Program
What it is: Designed and delivered a workshop, training course, or educational program teaching AI evaluation. Multiple participants, documented learning outcomes.
Minimum requirements:
- Minimum 8 hours of instruction (or equivalent self-paced online course)
- Minimum 10 participants
- Curriculum document (objectives, content outline, schedule)
- Delivery documentation (attendance, feedback, materials)
- Learning outcomes assessed (tests, projects, participation)
- You are primary instructor or course designer
- Novel or significantly improved content (not just repeating standard pedagogy)
Quality indicators: Participant feedback scores (target >4.0/5.0), certification or completion rates, follow-up adoption of concepts taught, repeat offerings.
7. Community Challenge Design
What it is: Designed and administered a public challenge, leaderboard, or competition focused on AI evaluation. eval.qa accepts these through its challenge committee.
Minimum requirements:
- Clear problem statement and evaluation criteria
- Minimum 20 participants
- Public leaderboard and results
- Evaluation methodology is novel or advances the field
- Documentation of findings and lessons learned
- You designed and managed the challenge
- Challenge runs for minimum 4 weeks
Quality indicators: Participant engagement, quality of submissions, knowledge contributed to field, adoption of challenge methodology by others.
8. Evaluation Dataset Release
What it is: Published a dataset designed for evaluating AI systems. Can be model outputs with human judgments, annotated examples, or benchmark data.
Minimum requirements:
- Minimum 500 examples (for smaller domains) or 2,000 examples (for larger domains)
- Comprehensive annotation guidelines (>1 page)
- Inter-annotator agreement documented (Cohen's kappa >0.60)
- Comprehensive dataset documentation (format, structure, fields)
- Public release with permissive license (CC-BY or equivalent)
- Hosted on permanent platform (Hugging Face Datasets, GitHub, Zenodo, etc.)
- You are primary data curator; contributions clearly stated if collaborative
- Usage tracking or citations (evidence of adoption)
Quality indicators: Downloads/usage, citations in papers, adoption by benchmark creators, follow-up research using the dataset.
How to Choose the Right Contribution Type
Different contribution types suit different backgrounds and constraints:
| Type | Time (weeks) | Skills Needed | Impact Level | Best For |
|---|---|---|---|---|
| Blog Post | 2-4 | Writing, evaluation knowledge | Medium (1,200 avg readers) | Quick wins; thought leaders |
| Open-Source Framework | 6-12 | Engineering, design, maintenance | High (distributed adoption) | Engineers with reusable code |
| Conference Talk | 4-8 | Public speaking, novelty | Medium (500+ views) | Storytellers; domain leaders |
| Standards Comment | 3-5 | Policy knowledge, writing | Medium-High (regulatory) | Policy-focused professionals |
| Peer-Reviewed Paper | 8-16 | Research, novelty, publication | High (long-term citations) | Researchers; academics |
| Workshop/Training | 6-10 | Teaching, pedagogy, community | Medium (20-50 beneficiaries) | Educators; mentors |
| Challenge Design | 8-12 | Project management, novelty | High (competitive research) | Community builders |
| Dataset Release | 4-8 | Data curation, annotation mgmt | High (reusable resource) | Domain experts with data |
Quick recommendation: If you have 2-4 weeks: Blog post. If you have reusable code: Open-source framework. If you enjoy public speaking: Conference talk. If you have policy expertise: Standards comment. If you're a researcher: Paper. If you love teaching: Workshop. If you're a project person: Challenge. If you have good data: Dataset.
Writing a Methodology Publication That Meets the Bar
A strong methodology publication has this structure:
- Hook (500 words): Problem statement. Why should readers care? What evaluation failure are you addressing? Real-world consequence of bad evaluation in this domain.
- Background (500 words): Existing approaches and their limitations. What have others tried? Why doesn't it work? What gap are you filling?
- Your approach (1,500 words): Step-by-step explanation of your methodology. Why these choices? How is it different? Include pseudocode, flowcharts, or detailed examples.
- Real data & results (1,500 words): Application of your method to real data. Show before/after. Include comparison to baselines. Quantify improvements.
- Implementation guide (800 words): How can readers implement this? Code snippets, libraries, tools. Remove the friction for adoption.
- Lessons & tradeoffs (500 words): What surprised you? Where does it fail? When shouldn't people use this approach? This honesty builds credibility.
- Call to action (300 words): What's the next step for readers? How can they extend your work?
Total: 5,700 words, strong structure, publishable.
Where to publish: Towards Data Science (technical audience), your company blog (reaches employees), personal Substack (builds audience), LinkedIn (professional reach). Avoid: Medium (declining reach), dev.to (lower quality bar), personal blog only (needs discovery).
Before publishing: Does it teach something readers don't know? Is there real data (not hypothetical)? Can readers implement this? Did you show where it fails (not just successes)? Is the writing clear (read aloud)? Have you gotten feedback from 1-2 peers?
Releasing an Open-Source Eval Framework
Step 1: Scope the problem. What evaluation task are you solving? Multimodal eval? Safety scoring? Calibration tracking? Be specific. A "general evaluation framework" is too broad; "RAG evaluation harness with retrieval + relevance + faithfulness metrics" is right-scoped.
Step 2: Design the API. How will users interact with this? What's the main class/function? Example usage:
from eval_framework import Evaluator, RAGMetrics
evaluator = Evaluator(metrics=[RAGMetrics.retrieval_precision,
RAGMetrics.answer_relevance])
results = evaluator.evaluate(inputs, outputs, references)
Make the API intuitive. First 30 seconds of usage should feel natural.
Step 3: Implement with tests. Write production-quality code. Tests for every function. >80% coverage. Error handling for common edge cases.
Step 4: Write documentation. README with: (1) What problem it solves, (2) Installation instructions, (3) Quick start example, (4) Full API documentation, (5) Advanced usage examples, (6) Contributing guidelines. This matters as much as the code.
Step 5: Publish and engage. Release on GitHub with clear license. Share in relevant communities (Reddit r/MachineLearning, Hacker News, Twitter). Respond to issues promptly. Merge good external PRs. Monitor usage.
The #1 reason open-source projects fail: lack of maintenance. If you release and disappear, people don't adopt it. Commit to responding to issues in <2 weeks for the first 12 months. After that, you can hand off or slow down.
Submitting to NIST: Standards Body Comments
Step 1: Find the relevant RFC or notice. Visit nist.gov, euaioffice.eu, or ieee.org for open comment periods. NIST runs several: AI Safety Institute requests, ARIA methodology comments, AISI guidance drafts.
Step 2: Understand the issue. Read the draft guidance carefully. What's the core question the agency is asking? What gap are you addressing?
Step 3: Write your comment (1,000+ words). Structure:
- Executive summary (100 words): What's your main point?
- Technical analysis (600 words): Specific issues with the draft guidance and your proposed solutions.
- Implementation considerations (200 words): How would regulators implement your suggestion?
- Evidence (100 words): References, citations, real-world examples.
Step 4: Submit formally. Follow the submission process exactly (online form, email, document format). Include your name, affiliation, credentials. Be professional.
Step 5: Document and publicize. Once submitted, publish a summary on your blog or LinkedIn. Reference the comment filing number. This increases impact and visibility.
Community Peer Review Through eval.qa
Before finalizing your contribution, submit it to the eval.qa community review process. This is optional but strongly recommended.
How it works:
- Submit your contribution (draft article, code repo, proposal) to eval.qa review portal
- Community volunteers (other Commanders, advanced L3+ candidates) review and provide feedback
- You get 2-3 detailed reviews within 2 weeks
- You revise based on feedback
- You get a "community-vetted" badge when ready
This is not required for submission, but it significantly strengthens your portfolio. "Community-vetted blog post" is stronger than "unvetted blog post."
Documentation Requirements: Proving Your Contribution Meets the Bar
When you submit your portfolio, you must document that your contribution meets minimum criteria:
For blog posts: Link to published article + proof of readership (screenshot of view count, analytics). For social proof: include comments, shares, or citations.
For open-source: Link to GitHub repo + proof of adoption (fork count screenshot, star count screenshot, issue activity). If possible, evidence of external usage (citations in other projects, production deployments).
For conference talks: Link to video recording + event page showing acceptance. Include attendance proof (attendee list or event capacity).
For standards comments: Link to submitted comment in official comment database (searchable by name). Include final agency response if available.
For papers: Link to published version or preprint + venue information (acceptance rate, selectivity). Include citation metrics if available (Google Scholar).
For workshops: Curriculum document + attendance list + participant feedback form + learning assessment results.
For challenges: Challenge website + leaderboard results + participant count + analysis of learnings.
For datasets: Link to public dataset + download count statistics + citations or usage in other projects.
Timeline: How Long Does a Contribution Take?
Industry contributions typically take 4-8 weeks of focused work (not consecutive, can be spread over months).
- Week 1: Choose contribution type, brainstorm specific topic, validate demand.
- Week 2-4: Create the contribution (write, code, collect data, etc.)
- Week 5: Self-review, improve quality, fix issues.
- Week 6: Peer review (optional) or community feedback collection.
- Week 7: Revisions based on feedback.
- Week 8: Final polish and publication.
Fastest path: blog post (2-4 weeks). Slowest path: peer-reviewed paper (12-24 weeks with submission and review cycles).
Real Contribution Examples with Impact Metrics
Example 1: Blog Post - "RAG Evaluation Beyond BLEU"
- Published on Towards Data Science, 2,400 words
- Real data from 5 RAG systems
- 2,100 reads in first month
- 45 responses/comments
- Shared 300+ times on LinkedIn
- Cited in 8 subsequent papers on RAG evaluation
- Contribution: Now widely adopted metrics for RAG evaluation
Example 2: Open-Source - "EvalKit" Library
- Python library for common evaluation metrics
- 620 GitHub stars, 92 forks
- Active maintenance (avg response time 8 days)
- 7 external PRs merged from community
- Used by 150+ companies (inferred from forks + downloads)
- Contribution: Standard toolkit adopted by industry
Example 3: Conference Talk - NeurIPS 2024 "Evaluating Generative Models at Scale"
- Main conference oral presentation
- 900+ views on NeurIPS website
- 2 follow-up papers citing methodology
- Invited talks at 3 companies based on presentation
- Contribution: Influenced industry evaluation practices
Frequently Asked Questions
Can I collaborate with someone else on a contribution?
Yes. But your contribution must be clear. "I was primary author and handled X, co-author handled Y" is fine. "I was equally involved in all aspects" is harder to verify. Contributions with clear ownership are stronger.
What if my contribution doesn't get accepted on first try?
For papers: expect rejection and revision. For open-source: deploy and iterate. For blog posts: publish anywhere that accepts; aim higher next time. Rejections don't disqualify you, but accepted contributions are stronger portfolio evidence.
Does employer approval matter?
For open-source: check with your employer's IP policy. Most allow you to open-source non-proprietary evaluation frameworks. For blog posts: you may need approval if using company data. For standards comments: typically fine as personal contribution. Check your employment agreement.
Can I reuse code from my company's eval system?
Only if your employer approves and it's properly licensed. Don't assume open-sourcing is permitted. Check IP policy first. If they don't allow it, choose a different contribution type (blog post, paper, standards comment, challenge design all work without code).
How do I know if my contribution is at "Commander level"?
Ask: (1) Does it advance the field or help practitioners? (2) Is the quality high (well-written, rigorous, complete)? (3) Did community respond positively? (4) Would I be proud to put this in a portfolio? If yes to all four, you're good.
