What a DCR Is and Why It Exists

A Deployment Clearance Report is the authoritative, frozen document that states whether a system is ready to go to production and under what conditions. Unlike an evaluation report (which is exploratory), a DCR is a gate document. It answers a single question: Should we deploy this?

The DCR exists because:

A good DCR is a decision artifact, not a knowledge dump. It contains enough information to justify the decision but is ruthlessly edited to prevent noise.

The 6-Section Structure

1. Executive Summary (250-400 words)

This is read by 80% of readers. It must stand alone. A busy VP should be able to read only this and understand the go/no-go decision and the top 3 risks.

Sub-components:

Write the Executive Summary last, after all sections are complete, to ensure accuracy.

2. Scope and Methodology (400-600 words)

This section prevents future arguments about whether the evaluation was relevant. Be painfully specific.

Sub-components:

3. Findings (600-800 words)

This is the meat. Organize by capability, not by metric. A reader should understand what the system is good at and what it's bad at.

Sub-components:

4. Risk Assessment (400-600 words)

Risks are not areas where performance was mediocre. Risks are scenarios where real harm could happen.

Sub-components:

Example risk structure:

Risk: Model hallucination in medical recommendations. Severity: HIGH (patient safety). Likelihood: 2.1% (observed in eval set, extrapolated to production). Detectability: MEDIUM (requires human review of high-confidence medical recommendations; ~500 per day). Mitigation: (1) Real-time alert for any medical claims + recommendation (2) Requires second human review before showing to user (3) Monthly audit of alerts. Residual risk after mitigation: LOW.

List 5-8 distinct risks. Not every concern is a risk; some are just "areas we don't have data on."

5. Recommendation (100-200 words)

The shortest section. Three options:

  1. APPROVED: Deploy as-is. Performance meets all criteria. No additional monitoring beyond standard practice.
  2. APPROVED with CONDITIONS: Deploy, but with specific conditions (monitoring, guardrails, user restrictions). System doesn't deploy if conditions aren't met.
  3. NOT APPROVED: Do not deploy. Performance gaps must be closed first. Recommend specific fixes before re-evaluation.

Write the recommendation defensively. Assume it will be read in a lawsuit.

Strong language: "APPROVED for production deployment in US healthcare settings. Prerequisite: Real-time hallucination detection system must be operational before launch (target deployment: ). Automatic escalation to human review required for confidence > 0.9 on medical claims. System is NOT approved for unsupervised use."

Weak language: "We think it's probably ready to go. Some monitoring would be good. Hallucinations are a concern but maybe they'll be okay."

6. Appendix (variable length)

Everything that supports the claims in sections 1-5:

DCR Writing Style Guide

Be specific. "High accuracy" is not acceptable. "94.2% accuracy on factual queries, 68% on synthesizing queries" is acceptable.

Quantify uncertainty. "Approximately 92% accurate" is weak. "92.1% ± 3.4% (95% CI)" is strong. This signals you understand statistical rigor.

Use numbers as evidence, not justification. Not: "The accuracy is 92%, which is good." Better: "Accuracy is 92%, vs. 85% for baseline, vs. 95% for production requirement. This is 3 percentage points below requirement; Section 4 details how this risk is mitigated."

Avoid hedging language: Not "might," "may," "could," "probably." Use "will," "does," "observed."

Example hedging → strong:

Own the decision. A DCR is not a hedge. Don't hide behind "more research needed." If you genuinely cannot decide, the recommendation is NOT APPROVED.

Writing the Go/No-Go Recommendation Defensibly

Your recommendation will be scrutinized. Write it as if you're testifying under oath.

Elements of a defensible APPROVED recommendation:

  1. Clear criteria that were pre-specified. "Deployment approved if: (a) accuracy ≥ 90% on all segments, (b) hallucination rate < 5%, (c) latency p95 < 3s, (d) inter-rater agreement κ ≥ 0.70. All criteria met."
  2. Acknowledgment of known unknowns. "We have high confidence in performance on factual queries. We have not evaluated performance on hypothetical/counterfactual queries. This is acceptable because ."
  3. Baseline comparison. If replacing an existing system: "Accuracy improved 4 percentage points. Hallucination rate decreased 60%. Latency increased 0.2s (acceptable tradeoff per stakeholder)."
  4. Specific monitoring and escalation. "Real-time monitoring of hallucination rate (daily dashboard, automated alert if > 4%). Monthly review of escalations. Automatic rollback if hallucination rate exceeds 5% for 2 consecutive days."

Elements of a defensible NOT APPROVED recommendation:

  1. Specific gaps. Not: "Accuracy is too low." Better: "Accuracy is 87%, below the 90% threshold. This is a 1.4 billion customer-facing system; 3% error rate (difference between 87% and 90%) means ~42 million quarterly exposures to errors."
  2. Path to approval. "To reach approval: retraining on 500 additional edge cases (estimated 2-week cycle) or reducing scope to FAQ-only use case (reduces error exposure to 8 million quarterly). Recommend retraining path."
  3. Re-evaluation criteria. "Will re-evaluate when: (a) retraining complete, (b) new eval set built from recent edge cases, (c) accuracy ≥ 90% on validation, (d) κ ≥ 0.75 on inter-rater agreement."

How Different Audiences Read the DCR

The CTO (30 seconds): Reads Executive Summary only. Needs to know: go or no-go, and the one biggest risk.

The Product Manager (5 minutes): Reads Executive Summary + Findings. Needs to understand: what does it do well, what does it do badly, can we launch it.

The Engineer (45 minutes): Reads everything. Needs to understand: the exact evaluation methodology, every failure mode, what monitoring to set up, what to be ready to debug.

The Compliance Officer (60 minutes): Reads Scope/Methodology, Risk Assessment, and Appendix. Needs to verify: was evaluation rigorous, are risks documented, can we defend this decision in an audit.

The End User (they don't): A public-facing DCR should exist, but a full technical DCR is internal.

Structure your DCR to serve all four audiences simultaneously. The Executive Summary lets the CTO skim. The appendix lets the engineer deep-dive. Section 4 (Risk Assessment) is where the Compliance Officer lives.

The Conditional Deployment Option

Many systems are neither clearly "GO" nor "NO-GO." They're "GO with guardrails." This is not a cop-out if structured properly.

Example: "APPROVED for production in healthcare settings, with the following mandatory conditions:

  1. Real-time hallucination detection system operational (Condition Status: ready for prod, deploy May 15).
  2. All outputs with confidence > 0.85 on medical claims require human review before showing to user (Condition Status: UX designer allocated, estimate June 1).
  3. Monitoring dashboard live with alerts if hallucination rate exceeds 4% on any rolling 24h window (Condition Status: data eng allocated, estimate May 20).
  4. Weekly review of flagged outputs for first 4 weeks, then monthly thereafter (Condition Status: ops team allocated).

System will NOT launch until all four conditions are met. Estimated launch date: June 5. If any condition is delayed >2 weeks, decision will be revisited."

This is concrete, not vague. It's deployable. It's defensible.

Sample DCR Excerpt (RAG System)

Scenario: Evaluating a RAG system that answers questions about a company's internal policies for 5,000 employees.

EXECUTIVE SUMMARY

RECOMMENDATION: APPROVED for production deployment, with mandatory real-time monitoring.

System: PolicyBot v2.1 (Git hash: a7f3e2b), candidate for replacing PolicyBot v1.0 (currently 300 queries/day).

Evaluation Summary: PolicyBot v2.1 correctly answers 94.2% of questions (accuracy on ground truth), vs. 88.3% for v1.0. Hallucination rate: 1.8% (20/1100 queries generated facts not in policy documents), vs. 4.2% for v1.0. Latency: p95 1.4 seconds (acceptable for async deployment model). Human evaluation (3 raters, κ=0.78) confirms accuracy on medical interpretation questions.

Top 3 Risks:
1. Hallucination on edge cases (salary policies): 8/20 hallucinations involve salary amounts. Likelihood: 0.7% of production queries. Mitigation: Real-time alert for salary-related answers; human review before showing to employee. Residual risk: LOW.
2. Performance on complex multi-step procedures: Success rate 71% on procedures requiring > 3 steps. Likelihood: 5% of queries. Mitigation: UI guidance to decompose complex questions; escalation to HR for complex cases. Residual risk: MEDIUM (acceptable given UI mitigation).
3. Handling of policy updates: System knowledge cutoff is Jan 1, 2024. Likelihood of out-of-date answer: 2% (recent policy changes). Mitigation: Monthly retraining on new policies; quarterly evaluation refresh. Residual risk: MEDIUM (expected for RAG systems).

Conditions: (1) Real-time hallucination detection system must be live before production. (2) Weekly monitoring of hallucination rate for first month. (3) Automatic rollback if hallucination rate exceeds 5%.

---

FINDINGS

Accuracy by Question Type:
- Factual (benefits, eligibility): 97.8% (156/159) 
- Multi-step procedures: 71.2% (41/57)
- Interpretation (does policy X apply to situation Y): 89.3% (67/75)
- Salary/compensation: 93.1% (27/29)
Overall: 94.2% (291/309)

Hallucination Analysis (20 instances):
- Salary amounts hallucinated: 8 instances
  Example: "Q: What is the bonus structure for engineers? A: 10-15% base salary (INCORRECT: policy specifies 5-8%)"
  Root cause: Retriever returned salary band document; LLM extrapolated to bonus structure.
- Policy dates hallucinated: 7 instances
  Example: "Q: When did remote work policy start? A: Started in 2019 (policy actually started 2020)."
  Root cause: Model training data included older company announcements.
- Procedure details hallucinated: 5 instances
  Example: "Q: Who approves PTO over 30 days? A: Department head and Finance (actually just Department head)."
  Root cause: LLM combining two related policies (PTO approval and budget approval).

Latency: p50=0.8s, p95=1.4s, p99=2.1s. All queries completed within 3s SLA.

Comparison to v1.0: Accuracy +5.9pp, hallucination -2.4pp, latency -0.2s. v1.0 failure modes (context cutoff, inability to handle multi-turn) now resolved.

---

RISK ASSESSMENT

Risk 1: Salary-related hallucinations (8/20 all hallucinations).
Severity: HIGH (employee makes decisions based on incorrect compensation info).
Likelihood: 0.7% of production traffic (~2 queries/day).
Detectability: HIGH (real-time check if salary-related terms appear in query + answer; human review required).
Mitigation: (1) Automated salary query detector + escalation to HR. (2) If confidence < 0.80 on salary queries, refuse answer. (3) Weekly audit of flagged queries.
Residual Risk: LOW.

Risk 2: Complex procedure failures (7/57 multi-step procedures failed).
Severity: MEDIUM (employee follows incomplete procedure, wastes time, may need to restart).
Likelihood: 5% (71% success rate on 5% of queries = 3.55% overall impact).
Detectability: MEDIUM (user feedback + monthly metric review).
Mitigation: (1) UI recommends breaking complex questions. (2) System refuses answers for procedures with > 4 steps. (3) Escalates to HR.
Residual Risk: MEDIUM (acceptable; user experience improvement justifies residual risk).

Risk 3: Policy updates lag (quarterly retraining).
Severity: MEDIUM (employee gets outdated policy).
Likelihood: 2% (estimate based on policy change velocity).
Detectability: MEDIUM (detected in monthly eval refresh; user reports).
Mitigation: (1) Monthly update of high-velocity policies (salary, benefits deadlines). (2) Quarterly full retraining. (3) Change log visible to users ("Last updated: [date]").
Residual Risk: MEDIUM (expected for any RAG system; business accepts this tradeoff).

Common DCR Mistakes

Mistake 1: Dumping data instead of making a decision. A DCR with 200 metrics and no clear recommendation is a bad DCR. Ruthlessly prioritize. Show the 4-5 metrics that actually matter for the decision.

Mistake 2: Hedging the recommendation. "We think it's probably ready, but more testing might be good" is not a recommendation. It's a non-decision that will cause the project to stall.

Mistake 3: Making conditions that are impossible to verify. Bad: "System should perform well across all user types." Good: "Accuracy ≥ 90% for users with tenure < 1 month, which we will sample weekly from production."

Mistake 4: Not explaining failure modes. If there are failures, explain why. "2% of queries failed" is weak. "2% of queries failed; 60% were out-of-scope prompts that user should have been filtered before reaching model; 40% were legitimate edge cases related to time-sensitive policies." Better.

Mistake 5: No baseline comparison. "Accuracy is 94%" means nothing without context. vs. what? vs. v1.0 (88%)? vs. production requirement (90%)? vs. human performance (96%)? All three.

Mistake 6: Ignored statistical rigor. If your eval set is small (n < 100), your confidence intervals are wide. Say so. If you didn't check inter-rater agreement, your accuracy numbers are questionable. Don't hide it.

DCR vs. Model Card vs. System Card

Document Purpose Audience Tone Length Decision Type
DCR Gate-keeping: is this ready to deploy? Internal (exec, engineer, compliance) Defensive, precise, decision-focused 5-8 pages Go/No-Go/Conditional
Model Card Model transparency: what is this model, what does it do? External (researchers, users, regulators) Transparent, comprehensive, educational 4-6 pages Informational (no gate)
System Card System transparency: how do all components work together? External + internal Educational, system-level view 8-12 pages Informational (no gate)

A DCR is action-forcing. Model Card is informational. You could publish your Model Card without publishing the DCR (for privacy/liability reasons). You cannot deploy without a DCR.

Key Takeaways

  • A DCR is a gate document: it authorizes or blocks deployment. It is not optional.
  • The 6-section structure (Executive Summary, Scope, Findings, Risk Assessment, Recommendation, Appendix) ensures all stakeholders can extract what they need.
  • Write defensibly. Assume your DCR will be scrutinized in a legal discovery process.
  • Quantify everything. Avoid hedging. Specify conditions concretely, not vaguely.
  • The recommendation (GO/NO-GO/CONDITIONAL) must be unambiguous. If you genuinely can't decide, the answer is NO-GO.