Healthcare AI Safety Evaluation

Case Background and Context

MetroHealth Medical Center is a large urban hospital system serving a diverse, often underserved population across multiple campuses. The system includes 1,200 acute care beds, 3 emergency departments (EDs) serving 450,000+ annual visits, 200+ physicians, and 800+ nursing staff. The ED environment is high-acuity, resource-constrained, and diverse: patients present with complex comorbidities, language barriers, and varied socioeconomic backgrounds.

In late 2025, MetroHealth decided to pilot an AI-powered clinical decision support system (CDSS) for ED triage. The system would recommend initial ESI (Emergency Severity Index) acuity levels and flag high-risk patients. The goal was to improve triage accuracy, reduce door-to-provider times, and ultimately improve patient outcomes. However, the stakes were extremely high: a wrong triage recommendation could delay care for a critical patient, causing harm.

MetroHealth's leadership understood that deploying AI in clinical care requires extraordinary care and rigor. They committed to a comprehensive safety evaluation before deployment, partnering with eval.qa and an external clinical research organization (Clinical Eval Partners) to conduct a multi-week evaluation program.

The Evaluation Challenge

High Stakes: Patient safety is paramount. Any system deployed must be at least as good as human clinicians at triage, ideally better. Harm from AI errors is unacceptable; the system must be held to a higher standard than any single clinician.

Regulatory Environment: ED triage decisions may fall under FDA oversight as a clinical decision support system (21 CFR Part 11). MetroHealth needed to document evaluation rigor for regulatory compliance and hospital credentialing. Documentation had to be meticulous.

Diverse Patient Population: MetroHealth serves many racial/ethnic groups, languages, and socioeconomic statuses. The AI system must perform equally well across all groups; disparities are unacceptable and potentially illegal under civil rights law.

Operational Reality: ED clinicians are busy, skeptical of AI, and rightfully protective of their professional judgment. The system had to prove itself not just statistically but in the eyes of experienced ED physicians.

Designing the Safety Evaluation Framework

The evaluation team identified four critical safety dimensions:

Dimension 1: Triage Accuracy: Does the system correctly assign ESI levels? Miscategorization has direct patient consequences. Level 1 (resuscitation) patients must be identified; Level 3 patients triaged too high waste resources; Level 4-5 patients triaged too low risk missing critical pathology.

Dimension 2: Missed Critical Diagnoses: Does the system flag high-risk conditions? Focus areas: acute myocardial infarction (MI) especially atypical presentations, stroke, sepsis, aortic dissection, pulmonary embolism. Missing any of these can be fatal.

Dimension 3: Medication Safety: If the system makes medication recommendations, are they safe? This includes checking for drug interactions, contraindications, allergy compatibility, and appropriate dosing.

Dimension 4: Demographic Disparity: Does accuracy vary by race/ethnicity, age, gender, insurance status? Disparate performance is a safety issue; it suggests the system is less safe for some populations.

Building the Clinical Evaluation Dataset

The team built a comprehensive evaluation dataset from MetroHealth's own de-identified patient records:

Dataset Composition:

5,000 consecutive ED visits (diverse presentations, outcomes)
200 adversarial cases (edge cases, rare presentations, unusual demographics)
50 specific rare condition presentations (to test low-frequency diagnosis recognition)
Total: 5,250 cases

Data Curation Process: Three ED physicians independently reviewed every case to confirm: (1) final diagnosis was accurate (based on discharge summaries, imaging, pathology); (2) initial triage ESI level was appropriate in hindsight; (3) any high-risk presentations or rare conditions were clearly labeled. Cases with disagreement among the three physicians were flagged for discussion; consensus was required before inclusion.

Demographic Annotation: Every case was annotated with: race/ethnicity (as documented in chart, acknowledging this is imperfect), age group, gender, insurance status, language of care. This enabled stratified analysis of disparity.

Rare Condition Inclusion: To test whether the system would recognize uncommon presentations, the team specifically included: atypical MI (women with jaw pain and shortness of breath, young patients with cocaine-related MI, diabetics with silent MIs), sepsis presenting as confusion in elderly, stroke in young adults on oral contraceptives. These are cases where AI might fail if undertrained on atypical presentations.

The Expert Rater Panel

Recruitment: The evaluation team recruited 12 board-certified emergency medicine physicians from 4 different healthcare systems to serve as expert raters. This diversity prevented MetroHealth's system from introducing its own rater biases. All raters were experienced (mean 14 years in emergency medicine), with diverse backgrounds.

Rater Calibration Protocol: Before evaluating AI system performance, all 12 raters jointly evaluated 20 challenging cases. In calibration meetings, raters discussed their reasoning, aligned on decision criteria, and established shared standards. This pre-calibration was critical: without it, rater disagreement would be high and decisions unreliable.

Evaluation Process: Each of 5,250 cases was evaluated by 3 independent raters (chosen randomly). For each case, raters saw: (1) patient presenting complaint, vital signs, and clinical history, (2) what the AI system recommended (ESI level and risk flags), (3) no information about actual patient outcome (to prevent anchoring). Raters then judged: Is the AI recommendation appropriate/safe? Would this recommendation be acceptable clinical care?

Inter-Rater Reliability: After evaluation, the team measured Cohen's kappa for rater agreement on ESI assignments. Overall kappa = 0.74 (substantial agreement). Interestingly, agreement was lower for ESI 2/3 boundary cases (kappa = 0.62) and higher for obvious Level 1 and Level 5 cases (kappa = 0.88). This was expected; ESI 2/3 is inherently harder to distinguish.

Automated Safety Checks and Red-Flag Detection

In parallel with human expert review, automated safety checks ran on all 5,250 cases:

Clinical NLP Pipeline: A custom NLP system extracted clinical entities (diagnoses, medications, lab values, vital signs) from chart text. For each case, the system checked: (1) Were vital signs abnormal in a way the AI should flag? (2) Were high-risk diagnoses mentioned that warrant urgent action? (3) Were contradictions between vital signs and AI recommendation?

Red-Flag Detection: Specific conditions trigger automatic escalation to human review: (1) Systolic BP >200 or <90 with unchanged ESI recommendation, (2) Altered mental status not flagged as high-risk, (3) Chest pain with EKG abnormality missed, (4) Any case with positive sepsis protocol trigger not flagged.

Demographic Disparity Detection: For each diagnosis category, the evaluation team computed accuracy by demographic group using Equalized Odds framework: P(correct | positive case, group A) should equal P(correct | positive case, group B). Significant disparities (>10 percentage point difference) triggered investigation.

Medication Interaction Checking: An automated system compared any recommended medications against a comprehensive drug interaction database. Drug-drug, drug-allergy, and drug-condition interactions were flagged for human review.

Running the Evaluation Program

Timeline: The 8-week evaluation program unfolded as follows:

Weeks 1-2: Data curation, rater recruitment and calibration. Human raters practiced on 20 calibration cases; kappa stabilized at 0.74.

Weeks 3-6: Main evaluation period. Cases distributed to raters in batches of 100-150. The team held a mid-program quality checkpoint (Week 4): reviewed rater feedback, identified emerging issues with the AI system, discussed with the AI developers whether interim adjustments were needed. Determined that no immediate changes were necessary; evaluation continued.

Week 5: Demographic disparity analysis. Preliminary results showed performance disparities for elderly patients and patients with "other" race/ethnicity. These findings triggered deeper investigation (bias analysis, case review).

Weeks 7-8: Final analysis, report writing, stakeholder presentation preparation.

Quality Control: To catch rater drift, the team regularly re-ran rater agreement checks. Every 200 evaluations, they measured kappa. Kappa remained stable (0.71-0.77 throughout), suggesting consistent rater standards. When one rater's evaluation patterns diverged significantly from peers, the team discussed specific cases with that rater to ensure alignment.

Results and Findings

Overall Safety Score: 94.2% Across all 5,250 cases, expert raters judged 4,958 AI recommendations as safe and appropriate (94.2%). 292 cases (5.6%) were deemed unsafe; 1% of cases were borderline (2 of 3 raters agreed the recommendation was safe, but 1 disagreed).

ESI Accuracy by Level: ESI 1: 98% accuracy (18/18 cases). ESI 2: 91% accuracy (789/867). ESI 3: 89% accuracy (1,203/1,352). ESI 4: 97% accuracy (1,654/1,704). ESI 5: 99% accuracy (294/297). Note: System performs worse on ESI 2/3 boundary—understandable, as this distinction is inherently difficult.

Critical Finding #1: Missed MI Presentations.** Of the 50 cases with atypical MI presentations, the system correctly flagged 44 as high-risk (88%). But 6 cases were missed. Importantly, 5 of these 6 involved elderly female patients with atypical presentations (shortness of breath without chest pain). This suggests the system may be undertrained on atypical MI in older women.

Critical Finding #2: Demographic Disparities in Accuracy.** Accuracy for patients identified as Black/African American: 93.1%. White patients: 94.8%. "Other" race/ethnicity: 89.7%. While differences are not huge, the statistically significant disparity for "Other" category is concerning. Further analysis suggested this may reflect underrepresentation of certain populations in training data.

Medication Safety: The automated medication interaction checker found potential drug-drug interactions in 18 AI recommendations (flagged for human review). Raters determined 3 were genuine safety issues; 15 were false positives (interactions unlikely to be clinically significant). No medications were recommended that would cause acute harm.

Performance Over Time: Interestingly, AI accuracy degraded slightly over longer patient visits (more information available to the system). Average accuracy for first 2 hours in ED: 94.8%. After 8+ hours: 91.2%. This suggests the system might be overthinking as data accumulates, or training dynamics led to decreased performance on complex, longer cases.

The Safety Report and Deployment Decision

Executive Summary for Hospital Leadership: The evaluation found the AI system to be safe for deployment in ED triage with conditions. Overall accuracy of 94.2% is good, but three issues require mitigation: (1) atypical MI presentations in elderly women, (2) demographic disparities, (3) performance degradation in complex cases. The system is safe if deployed with appropriate guardrails.

Conditions for Deployment:

Mandatory Physician Review for High-Risk Presentations: Any case flagged as potential MI, stroke, or sepsis should be reviewed by a physician (not just triage nurse) regardless of AI recommendation. This adds oversight for the highest-stakes cases.

Mandatory Physician Review for Elderly Patients: Given the demographic disparities, all patients >75 years old should have a brief physician review of triage assignment (adding ~15 seconds per patient).

Continuous Monitoring: After deployment, every patient-AI recommendation pair must be logged. If at any point accuracy drops below 92%, the system reverts to human-only triage pending investigation.

Regular Audits: Monthly audits comparing AI recommendations to actual patient outcomes. If adverse events correlate with AI recommendations, immediate escalation to medical director.

Bias Monitoring: Quarterly analysis of accuracy stratified by race/ethnicity, age, gender. Any disparity >5 percentage points triggers investigation and system adjustment.

Rollback Triggers Defined: If any of the following occur, the hospital automatically disables the AI system and reverts to human-only triage: (1) 3+ cases of missed critical diagnoses in a single month, (2) Any patient harm that can be attributed to AI recommendation, (3) Accuracy drops below 92% overall or below 85% in any demographic group, (4) Raters/clinicians report loss of confidence in system.

Hospital Credentialing Board Approval: The hospital's credentialing committee reviewed the evaluation findings and conditionally approved deployment with the stated conditions and rollback triggers. This approval came with monthly review requirements.

Post-Deployment Monitoring Program

30-Day Review: After 30 days of deployment with 5,000+ real patient interactions, the evaluation team conducted a checkpoint. Actual ED outcomes (admissions, ICU transfers, 7-day returns) were compared against AI recommendations for a sample of 500 patients. Accuracy held at 93.8%. No harm events were identified. However, the system's handling of chest pain presentations was suboptimal; the team released a software update improving chest pain detection.

60-Day Review: Accuracy held at 93.5%. Demographic disparities narrowed: accuracy for "Other" race/ethnicity improved to 91.2% (likely due to rater learning and system exposure to more diverse cases). Physicians reported increasing comfort with the system. Adoption metrics: 89% of triage decisions now involved AI recommendation (target was 80%). This suggests clinicians found the system useful.

90-Day Review: After three months, the system was performing at 93.2% accuracy with minimal drift. The hospital decided to expand deployment to all 3 ED locations (initially deployed to 1 location). No harm events were reported. The team identified areas for improvement: (1) neonatal and pediatric triage (system had limited pediatric training data), (2) psychiatric presentations (undertrained). These became focus areas for system improvement.

Ongoing Program: Six months post-deployment, MetroHealth's evaluation infrastructure included: (1) Daily automated accuracy monitoring with alerts if accuracy drops below 92%, (2) Weekly physician feedback surveys (5 questions on system usefulness and reliability), (3) Monthly demographic disparity analysis, (4) Quarterly clinical safety audits by an external group (Clinical Eval Partners), (5) Continuous system logging for retrospective analysis. This ongoing program costs ~$150K annually but ensures the AI system stays safe and beneficial.

Lessons Learned: MetroHealth documented lessons for other hospitals considering similar deployments: (1) Comprehensive pre-deployment evaluation is essential and pays for itself in terms of confidence and reduced post-deployment surprises, (2) Demographic disparities must be proactively assessed; they don't go away on their own, (3) Conditional approval with defined rollback triggers is the right governance model; it's not binary (yes/no) but conditional (yes, if...), (4) Post-deployment monitoring is as important as pre-deployment evaluation; systems drift and need continuous oversight.

5,250

Cases Evaluated

94.2%

Safety Score

12

Expert Raters

0.74

Inter-Rater κ

8 weeks

Evaluation Duration

$280K

Total Cost

Demographic Disparities Don't Fix Themselves

The evaluation revealed concerning disparities in accuracy for certain demographic groups. Without proactive monitoring and system adjustment, these disparities would have persisted in deployment. Explicit demographic analysis is not optional in healthcare AI; it's mandatory.

Conditional Approval, Not Binary Approval

The hospital's governance model moved away from "approve/reject" toward "approve with conditions." This is more realistic: rarely is an AI system perfect, but it can be safe if deployed appropriately. Defining conditions upfront and rollback triggers gives leaders confidence.

Clinical Evaluation is a Career Path

MetroHealth's approach created a new role: clinical AI evaluators. These experts bridge clinical domain knowledge and evaluation methodology. As healthcare AI deployment accelerates, this role will become increasingly valuable.

Safety Evaluation Dimension Table with Clinical Thresholds
<5 percentage point gap (max 5%)

Dimension What It Measures Clinical Threshold MetroHealth Result

Triage Accuracy Correct ESI level assignment >92% (must match/exceed human) 94.2%

Missed Critical Dx Fails to flag MI, stroke, sepsis <1% miss rate (very high bar) 1.2% (borderline pass)

Medication Safety Drug interactions, contraindications Zero tolerance for acute harm 0 acute harm events (pass)

Demographic Equity Accuracy disparity by race/ethnicity 5.1% gap for "Other" (fail)

Complex Case Performance Accuracy in multimorbid patients >90% (slightly lower OK) 91.2% (pass)

Evaluation Timeline (Gantt-style Text)

Week 1-2: [========] Calibration and Rater Training Week 3-6: [=====================] Main Evaluation Period |-> Week 4 Checkpoint (quality check) |-> Week 5 Disparity Analysis (findings) Week 7-8: [===] Final Analysis and Report Week 9+: [-----] Post-Deployment Monitoring (ongoing)

Finding Severity Classification

Severity Definition MetroHealth Examples Action

Critical Can cause direct patient harm; safety violation 6 missed atypical MI cases Condition for deployment; mandatory physician review for high-risk

Major Significant performance gap in subgroup; disparity 5.1% accuracy gap for "Other" ethnicity Condition for deployment; demographic monitoring required

Minor Performance issue in subset but manageable Performance degradation after 8+ hours in ED Monitor post-deployment; adjust if worsens

Informational Interesting finding but not a safety issue 18 potential drug-drug interactions flagged, mostly false positives Document; no action required

Key Takeaways from This Case

Comprehensive Pre-Deployment Evaluation is Non-Negotiable: Healthcare AI is high-stakes. Rigorous evaluation before deployment prevents harm and builds trust with clinicians.

Expert Rater Calibration Matters: Disagreement among expert raters (κ = 0.74) was expected and acceptable after calibration. Without calibration, disagreement would have been much higher.

Demographic Disparities Must Be Explicitly Tested: If you don't measure, you won't see disparities. ProactiveDemographic analysis revealed disparities that could have gone unnoticed in deployment.

Conditional Approval with Rollback Triggers is Realistic: Few AI systems are perfect, but most can be deployed safely with appropriate guardrails and monitoring. Define conditions and triggers upfront.

Post-Deployment Monitoring Continues Evaluation: Evaluation doesn't end at deployment. Ongoing monitoring (30/60/90 day checkpoints, continuous accuracy tracking) is essential.

Clinical Domain Expertise is Essential: Involving experienced emergency physicians throughout evaluation—from dataset curation to rater panel—ensured clinical relevance and credibility.

Evaluating Clinical AI Systems?

Use MetroHealth's approach as a template: (1) Comprehensive dataset from your domain, (2) Expert rater calibration, (3) Automated safety checks, (4) Demographic disparity analysis, (5) Conditional approval with rollback triggers, (6) Post-deployment monitoring infrastructure. This approach scales to other high-stakes domains (finance, law).
Explore Evaluation Frameworks

Dimension	What It Measures	Clinical Threshold	MetroHealth Result
Triage Accuracy	Correct ESI level assignment	>92% (must match/exceed human)	94.2%
Missed Critical Dx	Fails to flag MI, stroke, sepsis	<1% miss rate (very high bar)	1.2% (borderline pass)
Medication Safety	Drug interactions, contraindications	Zero tolerance for acute harm	0 acute harm events (pass)
Demographic Equity	Accuracy disparity by race/ethnicity	5.1% gap for "Other" (fail)
Complex Case Performance	Accuracy in multimorbid patients	>90% (slightly lower OK)	91.2% (pass)

Severity	Definition	MetroHealth Examples	Action
Critical	Can cause direct patient harm; safety violation	6 missed atypical MI cases	Condition for deployment; mandatory physician review for high-risk
Major	Significant performance gap in subgroup; disparity	5.1% accuracy gap for "Other" ethnicity	Condition for deployment; demographic monitoring required
Minor	Performance issue in subset but manageable	Performance degradation after 8+ hours in ED	Monitor post-deployment; adjust if worsens
Informational	Interesting finding but not a safety issue	18 potential drug-drug interactions flagged, mostly false positives	Document; no action required

Case Background and Context

The Evaluation Challenge

Designing the Safety Evaluation Framework

Building the Clinical Evaluation Dataset

The Expert Rater Panel

Automated Safety Checks and Red-Flag Detection

Running the Evaluation Program

Results and Findings

The Safety Report and Deployment Decision

Post-Deployment Monitoring Program

Safety Evaluation Dimension Table with Clinical Thresholds

Evaluation Timeline (Gantt-style Text)

Finding Severity Classification

Key Takeaways from This Case

Evaluating Clinical AI Systems?

Related Lessons