Evaluation Ethics Code: Standards, Cases & Professional Responsibility

Real-World Evaluation Ethics Violations

Evaluation ethics violations are more common than many evaluation professionals realize. Let's examine concrete cases where evaluation ethics went wrong, what happened, and what we can learn.

Case 1: The Biased Benchmark Selection

A large AI company developing a language model selected benchmark datasets and evaluation metrics that they knew favored their architectural approach. They had tested dozens of benchmark combinations and reported only those where their system ranked first. Independent researchers who attempted to reproduce results noticed the benchmark selection seemed suspiciously narrow and specifically chosen to highlight the company's strengths while avoiding benchmarks where competitors were stronger. The violation: deliberately selecting evaluation tasks to produce favorable results rather than comprehensively evaluating the system's actual capabilities. This is evaluation cherry-picking. The consequence: customers chose the product believing it was best-in-class when it excelled only on narrow benchmark dimensions.

Case 2: Undisclosed Conflicts of Interest

A consulting firm was hired to evaluate whether a company's AI hiring system was biased. The evaluation consultant conducting the study had significant financial incentives (hidden equity options, future contracts) contingent on the study finding no meaningful bias. The consulting firm failed to disclose these financial incentives, making it appear that the evaluation was independent. When the consulting firm found the system was not biased (contrary to what independent audits later discovered), the undisclosed conflict of interest undermined the evaluation's credibility. The violation: conducting evaluations without independent judgment, with financial incentives creating pressure for favorable findings.

Case 3: Cherry-Picked Error Analysis

An evaluation team tested a system's safety by generating 100 adversarial prompts designed to elicit unsafe outputs. The system failed 40 of them. Rather than reporting this as "40% of adversarial prompts elicited unsafe outputs," the team reported "successful on 92% of safety tests" by including all the benign tests in the denominator. Additionally, they didn't report the 40 failures in detailed analysis, describing only the high-level 92% number in executive summaries. While not technically false, the selective reporting pattern misleads about the system's actual safety. The violation: selecting which evaluation results to emphasize, creating misleading impressions.

Case 4: Evaluators Under Coercive Pressure

An evaluation team was instructed that their job security depended on completing 500 annotations per week, with no flexibility for quality. Under this time pressure, raters skipped items they found unclear, made quick judgments without careful deliberation, and were too fatigued to catch errors. Agreement scores dropped significantly, but management ignored the signal that evaluation quality was suffering. The violation: creating conditions that compromise evaluation quality through coercive performance pressure rather than supporting valid evaluation practices.

KEY INSIGHT

Evaluation ethics violations often aren't dramatic—they're subtle patterns of selective reporting, undisclosed conflicts, or organizational pressures that gradually degrade evaluation integrity.

The Evaluator's Duty to Report: Whistleblowing When You Find Fraud

As an evaluation professional, you have an ethical duty to report when you discover evaluation fraud or misconduct. This duty exists even when reporting carries personal risk. What does this mean practically?

Recognizing Evaluation Fraud

Evaluation fraud includes:

Fabricated data: Claiming evaluation was conducted when it wasn't, or reporting false evaluation results.
Deliberately misleading analysis: Selectively choosing or misrepresenting evaluation results to reach a predetermined conclusion.
Conflicts of interest that override judgment: Making evaluation decisions biased by financial incentives, rather than by what the data shows.
Coercive conditions preventing valid evaluation: Creating organizational conditions that make valid evaluation impossible while claiming standards were met.

Reporting Pathways

Internal reporting: First, attempt to report concerns internally, typically through your manager, compliance officer, or ethics board. Document your concerns clearly. Internal reporting is less risky and may lead to correction. If internal reporting doesn't resolve the issue, escalate.

External reporting: If internal reporting fails, consider reporting to regulators (if applicable), professional organizations, or responsible disclosure processes. Many companies have responsible disclosure policies for security and safety issues that may apply to evaluation fraud.

Legal protections: Whistleblower protection laws exist in many jurisdictions. If you're reporting fraud to regulators or law enforcement, you may have legal protections against retaliation. Consult with legal counsel before whistleblowing to understand your protections.

The Psychological Cost of Speaking Up

Recognizing evaluation fraud and choosing to report it is psychologically difficult. You may fear career consequences, worry about being labeled a troublemaker, or question whether you're overreacting. These are reasonable concerns. But remember: reporting evaluation fraud serves a purpose larger than any individual career. If your evaluation is wrong and you report it, you're preventing decisions based on misleading data. That's the core of professional ethics.

Professional Liability in Evaluation Practice

When evaluation professionals conduct work that leads to decisions with real consequences, they bear professional liability. Understanding this liability helps you recognize when your evaluation practices matter.

Liability for Evaluation Error

If you conduct an evaluation and your evaluation is wrong (failing to detect a problem, misidentifying a system as safe when it's unsafe, etc.), you could be liable if harm results. This liability applies particularly when you've held yourself out as an expert. If someone relies on your evaluation to make a decision and that decision causes harm, you might be sued.

To mitigate liability: document your evaluation process thoroughly, include clear limitations of your evaluation (what you did measure and what you didn't), acknowledge uncertainty, and be transparent about any assumptions or limitations in your methodology. Evaluations with clear documentation of methods and limitations are more defensible than vague evaluations.

Liability for Conflicts of Interest

If you conduct an evaluation with undisclosed conflicts of interest, and that evaluation is wrong or misleading, your liability is heightened. Courts view undisclosed conflicts as particularly egregious because they suggest you deliberately misled rather than made an honest error.

Protect yourself: disclose conflicts of interest fully. If you can't mitigate a conflict (for example, you're being paid by the company you're evaluating their product), acknowledge it clearly. Let stakeholders decide whether they trust your evaluation given the conflict.

Liability for Missing Known Risks

If you know certain evaluation dimensions are important (e.g., bias testing for a hiring AI) but omit them from your evaluation, you bear liability if the system later fails on that dimension. If the omission was deliberate or negligent, liability is even greater. Evaluation professionals have a duty to evaluate comprehensively, not selectively.

Ethics of Evaluating AI for Harmful Applications

Some AI systems are developed for applications that could cause harm: surveillance systems, weapons, manipulation and persuasion systems, etc. Should evaluation professionals evaluate these systems? This is a genuine ethical dilemma with competing values.

Arguments for Evaluating Harmful-Potential Systems

First, if the system will be deployed regardless of whether you evaluate it, then your evaluation might make it safer. A thorough safety evaluation could reveal and help mitigate risks. By refusing to evaluate, you might ensure the system deploys without rigorous safety testing.

Second, some harm-adjacent applications have legitimate uses. An AI system for analyzing audio could be used for surveillance (harm) or for medical diagnosis (benefit). Refusing to evaluate audio analysis systems entirely means refusing to evaluate legitimate beneficial applications.

Third, evaluation can be a form of accountability. By evaluating a system meant for controversial use, you generate evidence about its actual capabilities and limitations. This evidence can later inform policy and oversight decisions.

Arguments Against Evaluating Harmful-Potential Systems

First, by evaluating a system, you lend credibility to the claim that it's safe enough to use. Even negative evaluation results ("this system has risks") might be interpreted as "we've vetted this system" when actually you've just documented failures. Your involvement might provide false assurance.

Second, your technical evaluation cannot address the political questions about whether surveillance, weapons, or manipulation should occur at all. Evaluating whether a surveillance system works well technically doesn't address whether surveillance should happen. By focusing on technical evaluation, you might be implicitly endorsing the application by converting it into a technical problem rather than a political one.

Third, by evaluating, you're participating in the system's development pipeline. You're making the system better, even if your evaluation finds problems. If you believe the system is being built for harmful purposes, should you participate in making it work better?

A Middle Path

Many evaluation professionals navigate this by adopting these practices:

Evaluate only when you can retain independence: If you evaluate a controversial system, ensure you can publish findings and control how results are reported.
Evaluate worst-case scenarios: Test how the system could be misused, not just how it's intended to be used. Include safety and misuse evaluation.
Involve diverse stakeholders: Bring in people from affected communities or with different perspectives. A diverse evaluation is more robust.
Distinguish technical evaluation from endorsement: Make clear that evaluating the system's technical performance is not endorsing its use for the proposed application.
Evaluate alternatives: Don't just evaluate the proposed system in isolation. Evaluate alternative approaches to the same problem and compare them fairly.

The Rater Ethics Code

If you work as a rater in evaluation studies, you have specific ethical rights and organizations have obligations toward you. This rater ethics code outlines both.

Fair Treatment

Raters deserve clear explanations of what they're evaluating, why it matters, and how their judgments are used. Before raters begin, organizations should explain the evaluation purpose, how results will be reported, and who has access to individual rater judgments. Raters should understand whether they're evaluating a public system, a private company's product, a research project, etc.

Fair Compensation

Raters deserve compensation that reflects the cognitive effort and specialized knowledge their work requires. Annotation work is often underpaid, with organizations expecting experts to work at minimum wage or even as volunteers. This is unethical. If you're organizing evaluation work, budget adequately for rater compensation. If you're a rater, negotiate for fair payment. Your expertise has value.

Psychological Safety

Some evaluation tasks require raters to engage with disturbing content: violent imagery, hate speech, graphic descriptions, etc. Organizations should not require raters to evaluate disturbing content without support, psychological safety protocols, and the right to refuse. Raters should have access to mental health support, reasonable limits on disturbing content exposure, and acknowledgment that the work can be psychologically taxing.

Right to Refuse

Raters should have the right to refuse to evaluate certain content without penalty. If a rater encounters content that violates their values or exceeds their psychological capacity, they can decline without career consequences. This means systems should have alternatives so raters can opt out of specific evaluation tasks.

No Coercive Scenarios

Organizations should not pressure raters into evaluating content by making refusal costly (e.g., losing the job, losing income, reducing hours). Refusal should be genuinely optional. This is particularly important for evaluation of AI systems trained to manipulate, deceive, or harm. Raters should not be pressured to engage deeply with such systems.

Transparency About Use

Raters deserve to know how their evaluations are used. Are results published? Are raters credited? Could raters' identities be determined from published results? Organizations should be transparent about these uses and protect rater anonymity when appropriate.

Evaluating AI Ethics Claims: How to Audit When Companies Claim Ethical Practices

Companies increasingly claim their AI systems are ethical, fair, transparent, accountable, or aligned with human values. Evaluating whether these claims are valid requires evaluating what I'll call "ethics claims evaluation"—assessing whether a system actually delivers on its stated ethical principles.

Red Flags in Ethics Claims

Vague language: "Our system is designed with ethics in mind" is vague. What specific ethical principles? What design choices operationalize them? Specific claims are more evaluable than vague ones.

No external validation: If the company's ethics claims are based only on internal evaluation, they're unvalidated. Has the system been audited by external researchers? Has the evaluation methodology been peer-reviewed?

No trade-off discussion: Ethical system design involves tradeoffs. A system that optimizes for privacy might sacrifice transparency. A system that optimizes for accuracy might increase disparate impact. Legitimate ethics claims acknowledge tradeoffs and explain how they're balanced. Claims that suggest no tradeoffs ("Our system is both fair and accurate") are suspicious.

No evaluation results published: If a company claims ethical practices but doesn't publish evaluation results, the claims are unverifiable. Real evaluation evidence should be publicly available.

Evaluation Questions to Ask

When auditing ethics claims, ask:

What specific ethical principle or problem does the claim address?
How is that principle operationalized into measurable evaluation criteria?
Who conducted the evaluation? (Internal? External? Peer-reviewed?)
What is the actual evaluation evidence? (Can you see numbers, not just narratives?)
What populations or scenarios were evaluated? (All populations? Worst-case scenarios?)
What were the actual results? (Not interpretations, but raw data.)
How do these results compare to competitor systems?
What tradeoffs were made to achieve this level of ethical performance?
What limitations exist in the evaluation? (What wasn't evaluated?)

Example: Auditing a "Fairness" Claim

A company claims their hiring AI is fair. To audit this:

Define fairness precisely: The company should specify what fairness means (equal selection rates across groups? Equal false positive rates? Representation of protected groups?). Different fairness definitions are contradictory; the company can't optimize for all simultaneously.

Examine the evaluation scope: Did they test only demographic groups mentioned in regulations? Or did they test intersectional groups (e.g., women of color vs. men of color vs. white women vs. white men)? Did they test across job categories?

Examine comparison systems: Is this AI fairer than the status quo (human hiring managers)? Than other AI systems? Relative comparisons matter.

Check for cherry-picking: Did they test one metric and report the one where they look best? Ask to see all fairness metrics tested.

Creating an Organizational Ethics Review Board for Evaluation

Organizations developing evaluation systems should establish internal ethics review boards similar to Institutional Review Boards (IRBs) in academic research. This ethics board reviews evaluation proposals, identifies ethical concerns, and provides guidance on ethical evaluation practice.

Board Composition

An effective ethics board includes:

Domain experts: People with expertise in evaluation, measurement, and statistics.
Ethics specialists: People with training in ethics, philosophy, or policy.
Affected community representatives: If evaluation concerns systems that affect particular communities (e.g., hiring AI, criminal justice AI), include representatives from those communities.
Independent members: People outside the organization if possible, to provide independent judgment.

Review Process

Before conducting evaluation studies, proposers should submit evaluation plans to the ethics board, including:

Evaluation purpose and methodology
Potential harms if evaluation is wrong
How raters are selected and compensated
How evaluation data is protected and used
Conflicts of interest
Plans for reporting results

The board reviews these, identifies concerns, and provides conditional approval (if concerns are addressed) or guidance on modifications.

Escalation Paths

The ethics board should have clear escalation paths. If serious ethical concerns are identified, evaluation should be paused while concerns are addressed. The board should have authority to disapprove evaluations that they believe are unethical, even if the evaluation is technically feasible.

Evaluation Ethics Principles

Evaluations must be valid and rigorous, not selective or biased toward predetermined conclusions
Conflicts of interest must be disclosed; independence must be maintained
Evaluation professionals have duty to report fraud and misconduct
Raters deserve fair treatment, compensation, psychological safety, and rights of refusal
Claims about ethical systems must be audited with the same rigor as technical claims
Organizations should establish ethics review boards overseeing evaluation practice

Commit to evaluation integrity

Ethical evaluation practice isn't optional—it's fundamental to trustworthy AI. As evaluation becomes more central to AI governance, evaluation ethics becomes more important. Build ethical evaluation practices into your organization's DNA.

Explore more on evaluation →

Real-World Evaluation Ethics Violations

Case 1: The Biased Benchmark Selection

Case 2: Undisclosed Conflicts of Interest

Case 3: Cherry-Picked Error Analysis

Case 4: Evaluators Under Coercive Pressure

The Evaluator's Duty to Report: Whistleblowing When You Find Fraud

Recognizing Evaluation Fraud

Reporting Pathways

The Psychological Cost of Speaking Up

Professional Liability in Evaluation Practice

Liability for Evaluation Error

Liability for Conflicts of Interest

Liability for Missing Known Risks

Ethics of Evaluating AI for Harmful Applications

Arguments for Evaluating Harmful-Potential Systems

Arguments Against Evaluating Harmful-Potential Systems

A Middle Path

The Rater Ethics Code

Fair Treatment

Fair Compensation

Psychological Safety

Right to Refuse

No Coercive Scenarios

Transparency About Use

Evaluating AI Ethics Claims: How to Audit When Companies Claim Ethical Practices

Red Flags in Ethics Claims

Evaluation Questions to Ask

Example: Auditing a "Fairness" Claim

Creating an Organizational Ethics Review Board for Evaluation

Board Composition

Review Process

Escalation Paths

Evaluation Ethics Principles

Commit to evaluation integrity

Related Lessons