AI Evaluation in Legal Services

Why Legal AI Has the Highest Accuracy Bar

In most AI domains, a 95% accuracy rate is considered excellent. In legal AI, it's barely acceptable. Why? Because a 5% error rate in legal services means one in twenty interactions involves incorrect advice, missed deadlines, misunderstood contracts, or overlooked statutory requirements. The cost isn't a poor user experience. The cost is malpractice, regulatory violations, lost cases, and direct financial harm to clients.

Legal AI errors have consequences that most other domains don't face:

Professional liability: Attorneys are personally and professionally responsible for AI-generated legal analysis they rely on. An error by the AI system becomes the attorney's malpractice claim.
Regulatory exposure: State bars have rules about the practice of law, competence, and the use of technology. AI that doesn't meet a competence threshold can expose the attorney and firm to disciplinary action.
Privilege waiver: Improperly disclosed privileged information can waive the attorney-client privilege, opening communications to the opposing party. An AI that miscategorizes privilege creates liability.
Missed deadlines: If AI research misses a statute of limitations or critical deadline, the case is lost. The error isn't recoverable.
Direct financial impact: Incorrect contract review, missed risk clauses, or misunderstood liability provisions cost firms and clients directly in settlements, damages, and lost revenue.

This means legal AI evaluation requires stricter standards, more specialized expertise, and greater attention to error cost distribution. A 2% error rate isn't acceptable if those errors cluster in high-stakes scenarios.

The Legal AI Landscape: What AI Does in Law Today

Legal AI spans several distinct applications, each with different evaluation requirements:

1. Contract Review and Risk Analysis

What it does: Analyzes contracts (NDAs, employment agreements, service agreements, licensing deals, M&A documents) to identify risks, missing clauses, non-standard terms, and compliance gaps.

Evaluation stakes: Very high. Missing a clause that creates unlimited liability or royalty obligations costs directly.

Key eval dimensions:

Clause identification accuracy (did the system find all relevant clauses?)
Risk classification accuracy (is this clause high-risk or low-risk?)
False positive rate (attorney time wasted on flagged non-issues)
False negative rate (missed risks)

2. Legal Research and Citation Verification

What it does: Searches legal databases, identifies relevant case law and statutes, provides citations and case holdings.

Evaluation stakes: Extremely high. A citation to a case that doesn't exist or that's been overruled is worse than no citation.

Key eval dimensions:

Citation accuracy (real vs. hallucinated)
Relevance (does the cited case actually support the point?)
Recency (is the law current or overruled?)
Jurisdiction specificity (NY law vs. CA law)

3. Document Review and eDiscovery

What it does: Categorizes documents as responsive/non-responsive, identifies privileged documents, classifies documents by type and relevance.

Evaluation stakes: Extremely high. Failing to identify privilege can waive it; misclassifying a document can make you produce it or fail to produce a required document.

Key eval dimensions:

Privilege identification accuracy
Responsiveness classification
Statistical sampling for quality assurance
Cost per document reviewed

4. Brief and Motion Drafting

What it does: Generates motion text, brief sections, legal arguments based on facts and case law provided.

Evaluation stakes: High. Bad briefs lose motions and cases. Bad arguments damage credibility with the court.

Key eval dimensions:

Legal reasoning accuracy
Citation accuracy (hallucination risk)
Persuasiveness (does the argument convince?)
Factual accuracy (does it match the facts provided?)

5. Compliance and Regulatory Analysis

What it does: Analyzes regulations, identifies compliance gaps, maps compliance requirements to specific controls or policies.

Evaluation stakes: High. Missed compliance requirements create regulatory risk and penalties.

Key eval dimensions:

Regulation coverage (did you find all applicable regs?)
Interpretation accuracy (correct reading of the statute?)
False positives (flagged compliance gaps that don't exist)
Jurisdiction specificity

6. Predictive Case Analytics

What it does: Predicts case outcomes based on historical case law, identifies factors that correlate with wins/losses, estimates probability of success.

Evaluation stakes: Medium to high. Bad predictions lead to bad case strategy, settlement decisions, and resource allocation.

Key eval dimensions:

Prediction accuracy on historical cases
Calibration (is 75% confidence actually right 75% of the time?)
Factor importance (are the important factors identified?)
Fairness (does the model favor certain case types or jurisdictions?)

Critical Point

Each of these applications has different evaluation requirements. You cannot use the same eval framework for contract review, legal research, and eDiscovery. The error types, costs, and verification processes are completely different.

Evaluating Legal Research AI: Citation Accuracy and Relevance

Legal research AI is uniquely dangerous because it produces citations that *sound* authoritative even when they're false. This is the "Mata v. Avianca" problem made systemic: an AI system that confidently cites non-existent cases.

The Core Eval Problem

An attorney cannot verify every citation in a legal research AI output. If the AI cites a case, the attorney assumes it's real and moves forward with legal strategy based on a hallucinated case. This can take weeks to discover—or never, if the opposing counsel is lazy.

Evaluation must focus on:

1. Citation Accuracy: Real vs. Hallucinated

Every citation the system produces must be verified against Westlaw/LexisNexis or court databases.

Verification protocol:

Take a random sample of 200 citations from the AI system output
For each citation, verify in Westlaw, LexisNexis, or Google Scholar that:
- The case/statute name is spelled correctly
- The citation format is correct (e.g., 123 F.2d 456, not 123 F.2d 4567)
- The holding described by the AI matches the actual holding
- The case has not been overruled or reversed
Calculate: Citation accuracy = (Verified + Correct) / Total
Threshold: 99%+ for deployment. Below 97%, do not deploy.

Expected error rate: Even state-of-the-art systems show 1-3% hallucination rate on citations. This is unacceptable for legal research.

2. Relevance: Does This Case Actually Support the Point?

The system cites a real case, but does it actually support the argument the attorney is making?

Verification protocol:

For each citation in the sample:
- Read the actual case (or headnotes)
- Rate relevance on a 4-point scale:
  - Directly on point (case directly supports the argument)
  - Factually analogous (similar facts, principle applies)
  - Tangentially related (relates to the legal principle but not the specific argument)
  - Irrelevant (case doesn't support the argument at all)
Calculate: Relevance score = (Directly on Point + Factually Analogous) / Total
Threshold: 85%+ for deployment

3. Statutory Currency: Is the Law Current?

A law that's been amended, repealed, or superseded is no longer good law.

Verification protocol:

For statutes cited, verify:
- Is this statute still in effect?
- Has it been amended since the AI's knowledge cutoff?
- Has it been repealed?
- Are there any pending amendments?
Calculate: Currency accuracy = (Current + Correctly Noted if Pending) / Total
Threshold: 99%+

4. Jurisdictional Appropriateness: Right Law for the Right Place

The system must cite law appropriate to the jurisdiction. NY law doesn't apply to cases in California unless there's a conflict-of-law rule.

Verification protocol:

Create eval cases with multi-jurisdiction scenarios:
- "Evaluate NY contract law" → Does it cite NY cases, not CA?
- "Research federal patent law" → Does it cite federal circuit cases?
- "Tax treatment in UK" → Does it cite UK law, not US?
Measure: Jurisdictional accuracy = (Cases from correct jurisdiction) / Total cited
Threshold: 98%+

Legal Research AI Eval Metrics

Citation Accuracy: % of citations that exist and are correctly quoted (99%+ required)
Hallucination Rate: % of citations that are hallucinated (0-1% acceptable)
Relevance: % of citations directly on point or factually analogous (85%+ required)
Recency: % of statutes that are current law (99%+ required)
Jurisdiction Accuracy: % of citations from correct jurisdiction (98%+ required)
Completeness: % of key cases/statutes the AI found (vs. manual research baseline)

Contract Review AI Evaluation: Clause Identification and Risk Scoring

Contract review AI is evaluated on two dimensions: does it find all important clauses, and does it correctly assess their risk level?

1. Clause Identification Accuracy

The AI must identify all relevant clauses in the contract, including non-standard clauses that attorneys might miss.

Evaluation protocol:

Create a ground-truth clause inventory:
- Have a partner-level attorney review the contract and identify all clauses
- Include standard clauses (indemnification, termination, renewal) and non-standard variations
- Create a complete list of expected clauses
Run the AI system: Have it identify and extract all clauses
Calculate precision and recall:
- Precision = (AI-identified clauses that match ground truth) / (Total clauses AI identified)
- Recall = (AI-identified clauses that match ground truth) / (Total ground truth clauses)
Threshold: 95%+ recall (missing clauses is worse than false positives), 90%+ precision

2. Risk Scoring Calibration

The AI rates each clause as low/medium/high risk. This scoring must be calibrated to attorney judgment.

Evaluation protocol:

Create a ground-truth risk assessment:
- Have experienced contract attorneys rate each clause on a 5-point risk scale
- Include context: contract type, industry, deal size
- Record reasoning for each rating
Run AI system and compare:
- Calculate agreement between AI and attorney on risk level
- Calculate Cohen's kappa (inter-rater reliability)
Cost-of-error analysis:
- A false negative (AI says low-risk, attorney says high-risk) is very expensive
- A false positive (AI says high-risk, attorney says low-risk) is just wasted attorney time
- Weight evaluation toward false negative reduction
Threshold: Kappa of 0.75+ (substantial agreement)

3. Redline Quality

If the system generates proposed redlines or revisions, evaluate these separately:

Evaluation protocol:

Have the system generate proposed contract revisions to address flagged risks
Have attorney reviewers rate each revision:
- Does it effectively address the risk?
- Is it market-standard language?
- Does it create new problems?
Calculate: Quality = (Revisions rated as good/acceptable) / Total revisions
Threshold: 80%+ (attorney still needs to review)

Attorney-Client Privilege Boundaries: The Unique Legal Risk

In eDiscovery and document review, AI systems must recognize attorney-client privilege and treat privileged documents accordingly. An AI that fails to identify privilege or improperly processes privileged documents creates liability that exceeds any other AI error.

The Privilege Problem

Attorney-client privilege protects confidential communications between an attorney and their client made for the purpose of seeking/providing legal advice. Once privilege is waived (by disclosure to a third party), it cannot be reclaimed. An AI that incorrectly identifies a privileged document as non-privileged and proposes producing it in discovery can destroy attorney-client privilege permanently.

Evaluation Dimensions

1. Privilege Recognition Accuracy

Create test scenarios:

Clear privileged: Email from attorney to client with legal advice
Clear non-privileged: Business email between client employees
Business judgment: Email with both legal advice and business advice (privilege may apply to the legal part only)
Privilege waived: Email that was previously disclosed to a third party (no longer privileged)
Privilege under attack: Email in litigation where opposing counsel claims the privilege was waived
Work product doctrine: Materials prepared by attorney in anticipation of litigation (similar to privilege)

Evaluation protocol:

Create a dataset of 200+ documents with known privilege status
Have the AI classify each as privileged/non-privileged
Calculate accuracy separately for:
- False negatives (privileged document marked non-privileged) - cost is catastrophic
- False positives (non-privileged document marked privileged) - cost is lower
Threshold: 99%+ recall on true privileged documents (fewer than 1 in 100 privileged documents miscategorized)

Jurisdiction and Governing Law: The Multi-Jurisdictional Challenge

Modern AI legal systems must handle multi-jurisdictional scenarios correctly. A single contract might involve NY law, California consumer law (if a California user is involved), and federal law simultaneously.

Evaluation Dimensions

1. Jurisdiction Identification

The AI must identify which jurisdiction's law applies, including:

Explicit choice-of-law clauses
Implicit jurisdiction (where the contract is performed)
Multi-state scenarios (different rules apply to different aspects)

2. Jurisdiction-Specific Analysis

Different jurisdictions have different rules. NY corporate law differs from Delaware corporate law. California employment law differs from Texas employment law.

Evaluation protocol:

Create multi-jurisdiction scenarios:
- Same contract language, governed by NY law vs. CA law
- Verify the AI gives different advice for each jurisdiction
Measure: Jurisdiction-specific accuracy = (Correct jurisdiction analysis) / (Total scenarios)
Threshold: 95%+

3. Cross-Border Scenarios

If the AI handles international law, verify it distinguishes between similar but different legal systems (UK law vs. US law, EU law vs. US law).

Important

For multi-jurisdictional systems, you should have evaluators with expertise in each relevant jurisdiction. A single generalist attorney cannot accurately evaluate a system's handling of NY, CA, TX, Delaware, and UK law simultaneously.

Legal Hallucination: The Mata v. Avianca Problem

In May 2023, a law firm submitted a brief citing Mata v. Avianca. The case didn't exist. The firm had used ChatGPT to research the case, and the model had confidently hallucinated it. The judge sanctioned the firm. Now "Mata v. Avianca" is the canonical example of legal AI hallucination risk.

Why Legal Hallucination is Uniquely Dangerous

In other domains, hallucination is a quality issue. In legal AI, hallucination is a professional liability issue. An attorney who relies on AI and cites a hallucinated case faces sanctions, malpractice liability, and disciplinary action.

Detection Methodology

1. Systematic Hallucination Testing

Protocol:

Create bait cases: Design test queries that would tempt the system to hallucinate similar-sounding cases
- "Find cases about product liability in aviation" (similar to Mata v. Avianca)
- "Precedents on non-compete agreements in California" (real topic, but specific cases don't exist)
Measure hallucination rate: % of results that are hallucinated vs. real
Threshold: 0% hallucinated cases acceptable for deployment. 0.1% for borderline acceptable. >0.5% unacceptable.

2. Confidence Calibration Testing

A system that says "I'm not sure about this citation" is safer than one that confidently asserts a hallucinated case.

Protocol:

Identify hallucinated vs. real citations in your test set
Measure the system's confidence score for each
Calculate: Are hallucinated citations marked with lower confidence? (They should be.)
If confidence is high on hallucinated cases, this is a serious red flag.

3. Hallucination Type Classification

Not all hallucinations are equal:

Complete fabrication: Case name and citation are entirely made up
Citation error: Real case, but citation format is wrong
Holding misrepresentation: Real case, but the holding is misrepresented
Obsolete law: Case is real but has been overruled

Evaluation should weight these differently:

Complete fabrications are the most dangerous (Mata v. Avianca type)
Holding misrepresentation is also very dangerous (wrong law cited)
Obsolete law is problematic but potentially recoverable (legal research should catch overruled cases)
Citation format errors are least dangerous (attorney can fix the format)

2.1%

Hallucination rate typical in first-generation legal research AI

0.3%

Hallucination rate in well-trained legal research systems

99%

Accuracy threshold for deployment in legal research

1 in 200

Citation error frequency that triggers ethical issues

Human Expert Rater Protocol for Legal AI

Legal AI evaluation requires specialized human raters. You cannot use general annotators for legal evaluation.

Rater Qualifications

1. Minimum Qualifications

Licensed attorney in the relevant jurisdiction(s)
5+ years of practice experience in the relevant practice area
Current bar membership in good standing
No disciplinary history

2. Specialized Requirements by Domain

Legal research evaluation: Attorney with legal research experience (law firm librarian or research attorney) OR practicing attorney in relevant practice area
Contract review: Attorney with transactional experience (M&A, corporate, real estate)
eDiscovery/privilege: Attorney with litigation discovery experience, ideally privilege holder training
Compliance: Attorney with regulatory compliance experience in relevant industry
Litigation prediction: Litigator with trial experience in the relevant practice area

3. Inter-Rater Reliability Requirements

You must have multiple raters evaluate the same items and measure agreement:

For binary classifications (privileged/not privileged): Cohen's kappa of 0.80+
For multi-class (risk levels): Krippendorff's alpha of 0.75+
For continuous scores (relevance 0-100): Pearson correlation of 0.75+

If raters don't agree with each other, they can't agree with the AI. Measure and enforce inter-rater reliability.

4. Rater Training and Calibration

Create rater training materials:

Written rubric for each evaluation task (explicit criteria)
Examples of each rating level with explanation
Practice set with feedback (10-15 items, with expert-provided ratings)
Retest on practice set before live evaluation (must achieve 85%+ agreement)
Periodic calibration meetings (every 50-100 items, discuss difficult cases)

Evaluating Legal AI for Regulatory Compliance: ABA Model Rules

Attorneys using AI must comply with their state bar's professional responsibility rules. The ABA Model Rules (which most states follow with variations) include several rules that apply to AI use:

ABA Model Rule 1.1: Competence

An attorney must provide competent representation, which includes keeping abreast of changes in law and legal practice, including benefits and risks of relevant technology.

Compliance eval question: Does the AI system's performance meet professional competence standards in the relevant practice area?

How to measure:

Compare AI output to the standard of care for the practice area
Would a competent attorney using this tool produce equivalent output?
Are there gaps the attorney should always independently verify?

ABA Model Rule 1.4: Communication

An attorney must keep the client reasonably informed about the representation and must explain matters to the extent necessary for the client to make informed decisions.

Compliance eval question: Does the AI system's output provide sufficient transparency for attorney-client communication?

How to measure:

Can the attorney understand the AI's reasoning and explain it to the client?
Does the system provide confidence scores or uncertainty estimates?
Can the attorney identify which parts of the analysis come from AI vs. attorney judgment?

ABA Model Rule 1.6: Confidentiality

An attorney must not reveal information relating to client representation without informed consent, with limited exceptions.

Compliance eval question: Does the AI system protect client confidentiality?

How to measure:

Is client data stored securely?
Are there any data breaches?
Is the AI system's training process transparent about data usage?
Does the system create residual data that could leak client info?

ABA Model Rule 1.8: Conflicts of Interest

An attorney must not permit a conflict of interest without informed consent.

Compliance eval question: Could the AI system create conflicts of interest?

How to measure:

Is the AI trained on cases where the firm's clients were opponents?
Does the AI have access to privileged information from other clients?
Could the AI's output be influenced by competing interests?

ABA Ethics Compliance Checklist

Rule 1.1 (Competence): AI output meets professional standard of care for the practice area
Rule 1.4 (Communication): Attorney can understand and explain AI reasoning to client
Rule 1.6 (Confidentiality): Client data and privileged information are protected
Rule 1.8 (Conflicts): AI doesn't create conflicts or use privileged info from competing interests
Rule 3.3 (Candor): AI output is truthful and doesn't misrepresent facts or law
Rule 8.4 (Misconduct): AI use doesn't constitute fraud, deceit, or dishonesty

Case Study: Document Review AI in eDiscovery (Privilege and Responsiveness)

Document review AI in litigation is a canonical high-stakes legal AI use case. The AI must classify tens of thousands of documents as:

Responsive vs. non-responsive to discovery requests
Privileged vs. non-privileged
Work product doctrine vs. non-work product

Errors in any of these categories create liability. Failing to produce a responsive document results in discovery violations. Producing a privileged document waives privilege.

Evaluation Protocol

1. Gold Standard Dataset Creation

Step 1: Assemble a representative sample of documents from the litigation dataset

Sample size: 500-1000 documents (statistically significant)
Diversity: Include emails, contracts, work product, personal correspondence, etc.
Document types: Mix of easy, medium, and hard classifications

Step 2: Have experienced eDiscovery attorneys classify each document

Minimum: Partner-level attorney with eDiscovery experience
Two independent raters for each document
If raters disagree, third rater provides tiebreaker
Calculate inter-rater reliability (Cohen's kappa 0.75+)

Step 3: Document the ground truth for each classification:

Responsive/Non-responsive + reason
Privilege status (attorney-client privilege, work product, none) + reasoning
Confidence level (certain, moderately confident, uncertain)

2. AI System Evaluation

Run the AI system on the gold standard dataset

Have the AI classify each document as:
- Responsive or non-responsive
- Privileged or non-privileged
- Confidence score for each classification
Compare AI classification to ground truth
Calculate metrics separately for responsiveness and privilege:
- Accuracy (overall correctness)
- Precision (% of positive classifications that are correct)
- Recall (% of actual positives that the system found)
- F1 score (harmonic mean of precision and recall)

3. Error Cost Asymmetry

Not all errors are equal:

Error Type	Cost	Penalty
Privilege missed (AI says non-privileged, actually privileged)	Catastrophic	Privilege waived, opposing party gets privileged material
Responsive missed (AI says non-responsive, actually responsive)	Very high	Discovery violation, sanctions, case dismissal possible
False positive privilege (AI says privileged, actually non-privileged)	Medium	Over-withholding, opposing party can seek compelled production
False positive responsive (AI says responsive, actually non-responsive)	Low	Over-production, but document may not contain sensitive material

Evaluation standard: Weight the evaluation heavily toward avoiding missed privilege and missed responsive documents. False positives are more acceptable than false negatives.

4. Statistical Sampling for Validation

You cannot review every document the AI classifies (there might be 100,000+). Use statistical sampling to validate the AI's output on the full dataset:

Protocol:

Run the AI system on the full document set
For each category (privileged, responsive, non-responsive), sample documents:
- Sample size: 100 documents per category (or 5% of the category, whichever is larger)
- Sampling method: Random or stratified random (by document type)
Have experienced attorney review the sample
Calculate sampling-based accuracy for each category
Estimate confidence intervals using binomial distribution

Confidence level for eDiscovery:

Privilege accuracy: 95%+ with 95% confidence (at most 1 in 20 privileged documents misclassified)
Responsiveness accuracy: 90%+ with 95% confidence

5. Cost-Benefit Analysis

Document review is labor-intensive. AI can reduce costs significantly, but must maintain accuracy.

Calculation:

Manual review cost: $X per document (typically $3-8)
AI review cost: $Y per document (typically $0.50-2)
AI error rate: Z% (cost of errors if not caught)
Sampling cost: Review S documents at $X cost to validate AI (quality assurance)

Net savings: [Full cost - Sampling cost] * (Cost reduction per document)

Example:

100,000 documents at $5/doc manual review = $500,000
AI cost: $1/doc = $100,000
Sampling validation: 500 documents at $5/doc = $2,500
Net cost: $102,500 (vs. $500,000 manual)
Savings: $397,500 (79% reduction) with quality assurance

Best Practice

The best eDiscovery AI implementations combine system efficiency with attorney-level sampling validation. You get the cost savings of AI with the risk management of attorney review on a statistical sample.

Legal AI Evaluation Summary

Legal AI evaluation differs fundamentally from general AI evaluation because errors have real legal and professional consequences. An attorney who relies on AI and that AI fails can face malpractice liability, disciplinary action, and sanctions.

Core evaluation principles for legal AI:

Higher accuracy bar: 95%+ is not sufficient. 99%+ required for citation accuracy, privilege identification, and legal interpretation.
Domain-specific expertise: Evaluators must be licensed attorneys with relevant practice experience. General raters are insufficient.
Hallucination is unacceptable: Zero tolerance for fabricated cases, misrepresented holdings, or obsolete law marked as current.
Privilege is non-negotiable: Privilege identification accuracy requires 99%+ recall. Missing even one privileged document is catastrophic.
Error cost asymmetry: False negatives (missed risks, missed privilege) are far more expensive than false positives (unnecessary flagging). Weight evaluation accordingly.
Multi-jurisdiction complexity: If the system operates in multiple jurisdictions, evaluate jurisdiction-specific performance separately. One jurisdiction's law is not a proxy for another's.
Regulatory compliance: Evaluate against ABA Model Rules and applicable state bar rules. Non-compliance is disqualifying.
Statistical validation: Use sampling-based quality assurance for large-scale deployment (eDiscovery, contract review at scale).

Teams that build legal AI correctly invest heavily in specialized human evaluation with licensed attorneys as raters. This is not optional. It's the cost of operating in a domain where errors have legal consequences.

AI Evaluation in Legal Services

Why Legal AI Has the Highest Accuracy Bar

The Legal AI Landscape: What AI Does in Law Today

1. Contract Review and Risk Analysis

2. Legal Research and Citation Verification

3. Document Review and eDiscovery

4. Brief and Motion Drafting

5. Compliance and Regulatory Analysis

6. Predictive Case Analytics

Evaluating Legal Research AI: Citation Accuracy and Relevance

The Core Eval Problem

1. Citation Accuracy: Real vs. Hallucinated

2. Relevance: Does This Case Actually Support the Point?

3. Statutory Currency: Is the Law Current?

4. Jurisdictional Appropriateness: Right Law for the Right Place

Legal Research AI Eval Metrics

Contract Review AI Evaluation: Clause Identification and Risk Scoring

1. Clause Identification Accuracy

2. Risk Scoring Calibration

3. Redline Quality

Attorney-Client Privilege Boundaries: The Unique Legal Risk

The Privilege Problem

Evaluation Dimensions

1. Privilege Recognition Accuracy

Jurisdiction and Governing Law: The Multi-Jurisdictional Challenge

Evaluation Dimensions

1. Jurisdiction Identification

2. Jurisdiction-Specific Analysis

3. Cross-Border Scenarios

Legal Hallucination: The Mata v. Avianca Problem

Why Legal Hallucination is Uniquely Dangerous

Detection Methodology

1. Systematic Hallucination Testing

2. Confidence Calibration Testing

3. Hallucination Type Classification

Human Expert Rater Protocol for Legal AI

Rater Qualifications

1. Minimum Qualifications

2. Specialized Requirements by Domain

3. Inter-Rater Reliability Requirements

4. Rater Training and Calibration

Evaluating Legal AI for Regulatory Compliance: ABA Model Rules

ABA Model Rule 1.1: Competence

ABA Model Rule 1.4: Communication

ABA Model Rule 1.6: Confidentiality

ABA Model Rule 1.8: Conflicts of Interest

ABA Ethics Compliance Checklist

Case Study: Document Review AI in eDiscovery (Privilege and Responsiveness)

Evaluation Protocol

1. Gold Standard Dataset Creation

2. AI System Evaluation

3. Error Cost Asymmetry

4. Statistical Sampling for Validation

5. Cost-Benefit Analysis

Legal AI Evaluation Summary

Related Lessons