Case Background: Regional Bank, Three AI Models, OCC Examination

MidCity Bank is a regional financial institution with $45 billion in assets under management. In 2025, they faced an OCC (Office of the Comptroller of the Currency) examination that included a deep review of their AI systems. Three production models were in scope: (1) Credit scoring for loan origination (deployed 2021, processes 5,000 applications/month), (2) Fraud detection for transaction monitoring (deployed 2020, evaluates 2M transactions/day), (3) Customer service chatbot for support questions (deployed 2023, handles 50K queries/month).

All three models were in production but had never undergone formal validation per OCC requirements. Model risk management (MRM) was ad-hoc: data scientists ran informal tests, but there was no documented validation methodology, no ongoing monitoring framework, and no evidence of third-party validation for high-risk models. The bank's Chief Risk Officer knew this was a problem and wanted to get ahead of the examination.

The Challenge: The bank had 6 months to implement formal evaluation and documentation for three models simultaneously, while maintaining production systems and addressing upcoming OCC examination. They didn't have prior evaluation expertise and needed to build credible evidence of model quality quickly.

Regulatory Pressure: OCC Examination Findings and Requirements

OCC SR 11-7: The OCC's guidance on model risk management requires institutions to validate models before deployment and monitor them continuously. Validation must document: model purpose, development data and methodology, validation methodology, performance against acceptance criteria, and approval from model governance authority. Ongoing monitoring must include: performance tracking, periodic revalidation, and alert triggers for degradation.

Examination Findings: During preliminary discussions with examiners, the bank was told: (1) Credit scoring model (high-risk due to regulatory implication on lending) needs third-party validation and documented evidence of fairness testing. (2) Fraud model needs evidence of false positive/false negative analysis and operational limits (when it triggers human review). (3) Chatbot needs evidence it's not providing misleading financial advice. All three need ongoing monitoring framework, not just one-time validation.

Regulatory Clock: Examiners would return in 6 months for formal examination. The bank needed: validated models with documented findings, ongoing monitoring dashboard showing data quality and performance, board-level risk reporting on AI models, and evidence of governance (who approves models, how are issues escalated).

The AI Governance Eval Framework Design: Three-Tier Risk Classification

The bank designed a three-tier framework: High-Risk (credit scoring, requires highest rigor), Medium-Risk (fraud detection, requires standard rigor), Low-Risk (chatbot, requires basic rigor). Each tier had specific eval requirements.

High-Risk Evaluation: (1) Comprehensive performance testing (accuracy, precision, recall overall and by demographic segment), (2) Fairness/disparate impact testing per Equal Credit Opportunity Act (ECOA) guidance, (3) Stability testing (does model perform consistently over time?), (4) Model comparison (is this model better than previous approach?), (5) Third-party validation by independent evaluator, (6) Evidence package for regulators.

Medium-Risk Evaluation: (1) Standard performance metrics (accuracy, precision-recall tradeoff analysis), (2) Segment performance (model performance by transaction type, customer segment), (3) Operational limits (at what confidence threshold does it flag for human review?), (4) Comparison to fraud expert baseline (do experts agree with model?), (5) Cost-benefit analysis (cost of false positives vs. catching fraud).

Low-Risk Evaluation: (1) Basic accuracy testing, (2) Harmful output detection (does it give clearly wrong financial advice?), (3) Content accuracy spot check (is retrieval source valid?), (4) User satisfaction (explicit feedback from support team).

Credit Scoring Model Evaluation: The Comprehensive Analysis

Model Context: The credit scoring model predicts loan default probability (0-100 score). It's used to set interest rates and approve/deny loans. High-risk because: (1) Regulatory: fair lending law requires non-discrimination by protected characteristics (race, gender, etc.), (2) Impact: determines who gets credit and at what price, (3) Criticality: core to bank's profitability.

Evaluation Methodology: The bank's evaluation included: (1) Performance Testing: Accuracy (AUC-ROC 0.78), precision-recall curve, default prediction at different score thresholds. (2) Fairness/Disparate Impact: Using ANOVA and Chi-squared tests, analyzed whether default prediction varied by demographic group (race, gender, age, zip code). Found: model predicted 8% default rate for Black applicants vs. 5.2% for white applicants (disparate impact ratio 1.54, threshold is 1.25). This triggered deeper analysis. (3) Causality Analysis: Used SHAP explainability to understand what features drove the disparity. Found: the model was relying on zip code as a proxy for race (redlining). Zip code itself was a strong feature but was correlated with race. (4) Model Comparison: Evaluated whether removing zip code improved fairness without hurting accuracy. Reimplemented model: accuracy dropped slightly (AUC 0.76 vs. 0.78) but disparate impact improved (ratio 1.12, below threshold). (5) Third-Party Validation: Hired external validation firm to independently replicate evaluation and audit methodology. Validation firm confirmed findings.

Results and Recommendation: The high-risk eval revealed a problem: the model had implicit bias that violated fair lending requirements. Recommendation: (1) Retrain model excluding zip code, (2) Accept slight accuracy decrease in exchange for fair lending compliance, (3) Implement quarterly fairness monitoring dashboard, (4) Document decision for regulators.

Fraud Detection Model Evaluation: Operational Complexity

Model Context: Fraud detection flagsuspicious transactions for human review. Medium-risk because: (1) Performance is harder to measure (fraud is rare, labels may be incomplete), (2) Cost-benefit tradeoff is critical (false positives create operational burden on investigators), (3) No regulatory requirement like credit, but reputational risk if fraud slips through.

Evaluation Methodology: (1) Performance Metrics: Precision-recall analysis showed: at current operating point, the model has 92% precision (investigators spend time on real fraud) but 45% recall (catches less than half of fraud). (2) Operational Threshold Optimization: Evaluated different decision thresholds: lower threshold catches more fraud but increases false positives (operational burden), higher threshold reduces false positives but misses fraud. Bank's decision: optimize for precision 90%+, which naturally resulted in 50% recall (catch 50% of fraud, keep false positive rate low). (3) Segment Analysis: Performance varies by transaction type: model is 88% precise on wire transfers (high-value, high-fraud-risk) but only 75% precise on daily card transactions. Model is better at high-value fraud. (4) Comparison to Baseline: Bank compared model to rule-based fraud detection (previous approach): model catches more fraud at lower false positive rate. (5) Temporal Validation: Tested whether model performance degrades over time (fraud patterns evolve): slight degradation observed over 6 months, suggesting quarterly retraining needed.

Results and Recommendation: Model is performing well but requires operational guardrails: (1) Adjust threshold to 85% confidence for automated flagging (reduces burden on investigators), (2) Implement quarterly retraining to maintain performance, (3) Daily monitoring dashboard showing false positive rate and fraud catch rate by transaction type, (4) Annual reassessment of model performance.

Customer Service Chatbot Evaluation: Safety and Accuracy Focus

Model Context: Chatbot answers customer questions about account balance, transaction history, and general banking products. Low-risk because: (1) It's not making binding decisions, (2) Humans can override/correct answers, (3) Customers can escalate to human support. But still requires evaluation: users trust bank, wrong advice damages reputation.

Evaluation Methodology: (1) Harmful Output Detection: Tested whether chatbot gives clearly wrong financial advice. Example bad outputs: "yes, you should take this high-interest loan" when customer has poor credit. Ran 200 questions covering edge cases where chatbot might be wrong. Found: 3% of responses were potentially misleading or wrong (factually incorrect statements). (2) Compliance Language Accuracy: Checked whether chatbot correctly explains compliance requirements (fair lending, right to dispute). Spot-checked 50 responses: 2 were non-compliant (didn't mention customer right to dispute). (3) PII Handling: Tested whether chatbot properly handles personally identifiable information. Found: chatbot appropriately redacts SSN, account numbers, etc. (4) User Satisfaction: Collected explicit feedback from support team on chatbot quality. 75% of support staff rated chatbot as "helpful", 15% as "sometimes helpful", 10% as "often wrong".

Results and Recommendation: Chatbot is performing acceptably but needs improvements: (1) Retrain on compliance language to eliminate 2% error rate, (2) Add explicit guardrails: if customer asks about specific investment advice, escalate to human (chatbot isn't qualified), (3) Quarterly manual sampling of responses to detect new failure modes, (4) User feedback integration: support team flags bad responses, those become training examples.

Documentation and Evidence Package: What Regulators Review

The bank compiled a Model Risk Management documentation package including:

Model Inventory: For each model: model ID, name, purpose, deployment date, owner, risk tier classification, validation date, next revalidation date.

Validation Reports: Comprehensive reports for each model including: (1) Validation methodology (what we tested, why, acceptance criteria), (2) Data used (train/test split, data period, sample size), (3) Detailed results (performance metrics, segment analysis, failure modes), (4) Methodology limitations (this approach doesn't measure X), (5) Conclusion (does model meet acceptance criteria?), (6) Recommendations.

Governance Documentation: Model Risk Governance Committee charter (who meets, when, what decisions), escalation process (when do issues get escalated to leadership?), approval records (who approved each model for deployment?), board reporting (monthly/quarterly AI model risk reports to board).

Ongoing Monitoring Evidence: Monitoring dashboards showing: (1) Model performance over time (accuracy, precision-recall), (2) Data quality metrics (input drift, feature distributions), (3) Alerts triggered and responses, (4) User complaints/escalations, (5) Revalidation schedule and completion.

Third-Party Validation: Independent validation firm report on credit model methodology, findings, and recommendations.

The Examination Outcome: Findings and Conditions

Credit Scoring Model: Examiners found: bias in the model (disparate impact issue), but bank had already identified and remediated. Finding: "Model requires removal of zip code feature and quarterly fairness monitoring. Bank's remediation plan is acceptable."

Fraud Detection Model: Examiners found: model is acceptable, but operational processes need documentation (what happens when model flags transaction? How are false positives handled?). Finding: "Operationalize the threshold and document the process." Condition: "Implement documented procedures for fraud model alerts and human review."

Chatbot Model: Examiners found: 3% error rate is low but should be monitored. Finding: "Model is low-risk. Implement ongoing monitoring and quarterly sampling."

Overall Assessment: "Bank's AI governance has significant gaps, but the bank has taken appropriate steps to address them. Recommend continuation of quarterly examination of AI governance program. Ensure credit model fairness improvements are implemented and monitored."

Timeline: Examination completed. Bank received "findings" (issues to address) and "conditions" (requirements with specific timeline). All are manageable and aligned with bank's remediation plan.

Remediation and Ongoing Program: Building Sustainable Evaluation

Credit Model Remediation: Bank retrained credit model excluding zip code, improving fairness ratio to 1.08 (below 1.25 threshold). Deployed new model. Implemented quarterly fairness monitoring dashboard (tracks disparate impact by demographic group monthly, alerts if it increases). Board receives quarterly AI model risk report including fairness metrics.

Fraud Model Operationalization: Documented decision threshold (85% confidence for automated flag) and investigation process. Created operational procedures and trained investigators. Implemented daily dashboard showing model performance: precision, recall, false positive rate, fraud catch rate. Alert triggers if precision drops below 85%.

Chatbot Improvement: Retrained chatbot on compliance language using flagged examples from support team. Added guardrails: if customer asks for specific investment advice, bot says "I can't provide investment advice, please speak to a financial advisor" and offers to transfer to human. Implemented quarterly sampling: support team reviews 50 random bot responses and flags errors. Errors become training examples for next retraining cycle.

Ongoing Monitoring Program: Bank established: (1) Model Risk Governance Committee (quarterly meetings, includes Risk Officer, CIO, heads of business units), (2) Monitoring dashboards (updated daily, shared with committee), (3) Revalidation schedule (credit model annually, fraud model semi-annually, chatbot annually), (4) Board reporting (quarterly AI model risk report, includes performance metrics, issues, recommendations).

Lessons Learned: Key Takeaways for Financial Services AI Governance

Documentation First: Regulators care as much about documentation as they do about results. MidCity Bank's evaluation only became credible when they documented methodology, findings, limitations, and recommendations in formal reports. "We think our model is good" doesn't work; "Here's how we validated it, here are the results, here's our limitation analysis, here's what we're doing about issues" does.

Ongoing Monitoring Non-Negotiable: One-time validation isn't enough. Regulators require continuous monitoring. The bank's decision to implement daily dashboards for fraud model and quarterly fairness monitoring for credit model was essential. Monitoring detected the bias issue that wouldn't have shown up in one-time validation.

Independent Validation Essential for High-Risk: For high-risk models (especially those with regulatory impact), third-party validation added credibility. It gave regulators confidence that findings weren't biased toward approval.

Bias Can Be Hidden: The disparate impact in the credit model (using zip code as a proxy for race) wasn't malicious; it was a natural result of how the model was built. Evaluation methodology that specifically looks for these issues (demographic segment analysis, SHAP explainability) is essential. The bank's decision to retrain without zip code and accept minor accuracy loss for fair lending compliance was the right choice.

Operational Guardrails Matter: The fraud model evaluation revealed that raw model performance metrics don't tell the full story. Operational context (cost of false positives, investigation resources) drives what threshold to use. Bank can't just deploy model at whatever threshold optimizes AUC; they need to consider operational reality.

Governance is as Important as Methodology: Having a Model Risk Governance Committee, documented approval processes, and clear escalation paths convinced regulators the bank takes AI seriously. Evaluation without governance feels like a one-off exercise; evaluation with governance is a program.

Key Takeaway

Financial services AI governance isn't primarily about technical sophistication—it's about documenting that you understand your models, validate them rigorously, monitor them continuously, and respond to issues. MidCity Bank's success came from systematic evaluation, transparent documentation, and commitment to ongoing improvement, not from having perfect models.