Written by Susan Miller*

Explaining Errors and Insights: Error Analysis Narrative for Clinical NLP Papers with SHAP/IG Phrasing

Struggling to turn model mistakes into reviewer-ready insights without overclaiming causality? In this lesson, you’ll learn to frame a disciplined error analysis for clinical NLP, build a stakes-aware evaluation grid, and report SHAP/IG attributions with cautious, defensible phrasing that drives actionable fixes. You’ll see clear explanations, AMIA/ACL-aligned examples, and targeted exercises (MCQs, fill‑in‑the‑blanks, error correction) to lock in structure, language, and placement across Methods, Results, and Discussion. Expect calibrated, publishable wording you can drop into Overleaf/Word with confidence and auditability.

1) Frame and scope: Clarify the narrative’s purpose, placement, and claims

An error analysis narrative in a clinical NLP paper is not a loose collection of observations. It is a deliberate, methodical explanation of where and why the model fails, what those failures mean for clinical use, and how insights from these failures inform methodological adjustments. Its purpose is twofold: first, to make the model’s behavior legible to readers who care about performance in context; second, to support credible claims about safety, generalizability, and utility. In clinical NLP, these claims must be conservative and well-bounded, because error costs vary by clinical scenario. Your narrative should therefore explicitly define which claims the study is making and which claims it is not making.

Placement in the paper structure matters. In AMIA and ACL venues, readers expect the primary description of the error analysis design to appear in Methods, with detailed results and interpretations in Results, and cautious extrapolation in Discussion. The Methods should identify the error taxonomy, dataset strata, and evaluation criteria that will be used. The Results should report quantitative patterns and qualitative interpretations. The Discussion should connect the error patterns to clinical stakes and propose practical improvements, avoiding overreach. This separation preserves clarity: Methods state what you will measure; Results show what actually happened; Discussion explains why that matters and what to do next.

Scope your narrative by defining the model’s intended use and the decision thresholds relevant to that use. A classifier for sepsis detection has different tolerance for false negatives than a model extracting medication names for population health research. State the unit of analysis (note, encounter, patient), the temporal window, and the domain shift conditions (e.g., hospital site, note type, or time period). Declare whether your analysis focuses on retrospective accuracy, triage support, cohort identification, or knowledge extraction. Each purpose requires different error attention. For instance, a triage tool emphasizes recall in high-stakes positives; a cohort curation tool may emphasize precision to avoid contamination.

Set the narrative’s claims explicitly. A strong error analysis narrative promises transparency (we show which errors occur, where, and why), responsibility (we quantify uncertainty and discuss clinical stakes), and actionability (we propose grounded changes such as threshold adjustments, recalibration, or domain adaptation). Avoid implying that feature attribution methods like SHAP or Integrated Gradients (IG) reveal causal mechanisms. Instead, describe them as tools for understanding model reliance patterns that may correlate with, but do not prove, clinically meaningful reasoning.

Finally, define the evidentiary standard you will use. AMIA and ACL reviewers value reproducibility and structured justification. Make your analysis pipeline auditable: specify how you sample error cases, how many cases you review per stratum, how you reconcile annotator disagreement, and how you aggregate signal from SHAP/IG across cases. Your framing should make it easy for reviewers to see that your narrative is not anecdotal but systematic.

2) Build the error taxonomy and evaluation grid: strata, error types, clinical stakes

The core of a publication-quality error analysis is an evaluation grid that connects model performance to a structured taxonomy of error types and context strata, all weighted by clinical stakes. Building this grid requires three components:

  • A clear error taxonomy that distinguishes label-related, data-related, and model-related failures.
  • A set of dataset strata that reflect real clinical heterogeneity and likely domain shift.
  • A mapping from error types and strata to clinical stakes that clarifies risk.

Start with the error taxonomy. Separate errors that arise from ground-truth ambiguity (e.g., weak labels, subjective criteria in notes) from errors due to data quality (e.g., OCR noise, templated text artifacts), and errors due to model generalization (e.g., over-reliance on spurious lexical cues). Within each category, define subtypes. For example:

  • Label/annotation errors: guideline ambiguity, inter-annotator disagreement, temporality mismatch (present vs historical), conditional/hypothetical statements misinterpreted as factual.
  • Data quality errors: note section misclassification, negation scope corruption, abbreviation ambiguity, domain-specific jargon drift across sites.
  • Model behavior errors: threshold-related misclassification near the decision boundary, calibration misalignment in minority subgroups, reliance on proxies that are clinically irrelevant or ethically sensitive.

Next, construct meaningful strata. Strata are partitions of the dataset that you believe could change model behavior or error costs. Common clinical strata include note type (discharge summaries, ED notes), patient demographics that are ethically permissible to analyze and relevant to fairness assessment, care setting (inpatient vs outpatient), time period (pre- vs post-policy change), site or health system, language variety, and phenotype prevalence bands. For extraction tasks, also consider linguistic strata such as negation presence, temporality markers, and section type. For classification tasks, include disease prevalence segments and comorbidity burden. Each stratum should be justified: explain why this dimension is likely to change error rates or stakes.

Then, make clinical stakes explicit. Not all false positives or false negatives are equal. Define stake tiers (e.g., critical, moderate, low) based on downstream consequences. A false negative in a high-acuity phenotype used for triage may be critical; a false positive in a research cohort may be moderate; a boundary error in an entity offset may be low unless it impacts medication dosing fields. Tie each stake tier to a practical risk description, such as missed intervention, unnecessary consult, or noisy dataset that affects effect estimates. By mapping strata to stakes, you make your narrative clinically legible: readers can see not just that errors occur, but which ones matter most.

With taxonomy, strata, and stakes defined, construct the evaluation grid. Rows represent strata; columns represent error types; cells record quantitative metrics (e.g., error counts, error rates, calibration metrics like ECE or Brier score, threshold-specific precision/recall) and qualitative notes summarizing observed patterns. Keep the grid reproducible by specifying sampling rules: for each stratum, analyze a fixed number of false positives and false negatives, and a matched set of true positives and true negatives for context. Document agreement procedures if multiple reviewers conduct qualitative analysis. This grid becomes the backbone of your Results: you will walk readers across the high-stake cells first, then synthesize cross-stratum patterns.

Finally, integrate uncertainty. For each metric, present confidence intervals and, where appropriate, Bayesian credible intervals or bootstrap intervals. For qualitative insights, indicate the sample size and selection method, and avoid overgeneralization. When you report subgroup metrics, ensure adequate sample size to avoid unstable estimates. If a stratum is small but clinically important, call it out with cautionary language and propose targeted future data collection or evaluation.

3) Report explainability with SHAP/IG: phrasing patterns, caveats, and synthesis into insights and actions

SHAP and Integrated Gradients can strengthen an error analysis narrative when they are used to describe model reliance patterns rather than to assert causal reasoning. Your goal is to show which input features or text spans the model weighted for particular predictions, and to align that with clinical context and the error taxonomy. To do this responsibly, use disciplined phrasing, consistent caveats, and a synthesis that feeds back into actionable changes.

Adopt phrasing that emphasizes contribution, locality, and uncertainty. For SHAP, describe feature attributions as contributions to the model’s output relative to a baseline. For IG, describe the cumulative gradient along a path from a baseline input to the actual input. Use language like:

  • “SHAP indicates that tokens in the Assessment section contributed positively to the positive class for sepsis in this note.”
  • “Integrated Gradients highlights the phrase pattern around negation as contributing negatively to the pneumonia prediction.”
  • “Aggregated attributions over the ED-note stratum suggest greater reliance on vital-sign tokens than on symptom descriptors.”

Avoid causal claims. Do not say, “The model used X because X causes Y.” Instead, say, “The model’s output increased when features resembling X were present,” or “Attribution mass was concentrated on X-type tokens in these errors.” When uncertainty is high—due to multi-collinearity, token overlap, or subword segmentation—state it explicitly. Note limitations, such as instability across random seeds or baselines, and the risk that attributions reflect spurious shortcuts.

Structure the reporting in three layers:

  • Local explanations for representative errors within high-stake strata. Describe how attributions align or misalign with clinically relevant cues. Keep the focus on error types: for instance, threshold-near false negatives where attributions focus on conflicting cues (negation versus symptom severity terms) can signal calibration issues.
  • Aggregate explanations across strata. Summarize which features dominate attributions for true positives versus false positives and how this pattern shifts across note types, sites, or time periods. Tie these to prevalence and linguistic differences. Ensure that aggregation respects tokenization differences and that you use consistent baselines for IG.
  • Cross-method consistency checks. Compare SHAP and IG patterns to assess robustness. If both methods converge on similar reliance patterns, confidence increases; if they diverge, present both and discuss methodological reasons (e.g., SHAP’s model-agnostic approximations vs IG’s gradient dependence) and practical implications for interpretation.

Always link attributions to the error taxonomy. For data quality errors, show how attributions concentrate on noisy sections (e.g., template headers), suggesting a need for section filtering. For label ambiguity, demonstrate that attributions highlight cues consistent with either interpretation, supporting guideline refinement. For model-related errors, use attributions to identify proxy reliance (e.g., billing codes in text) that could create fairness concerns or poor portability.

Translate explanations into actionable improvements with explicit mechanisms:

  • Threshold tuning: If high-stake false negatives cluster near the decision boundary and attributions show balanced competing evidence, propose threshold adjustments for those strata, with accompanying decision-curve or utility analyses. State how you would set different operating points by stratum when clinically justified and ethically permissible.
  • Calibration: If attributions indicate reliance on unstable cues and calibration metrics are poor (e.g., high ECE) in certain strata, recommend recalibration (e.g., temperature scaling, isotonic regression) using held-out data matched to that stratum. Explain how improved calibration changes risk communication in clinical settings.
  • Domain adaptation: When aggregate attributions shift across sites or note types, propose domain-adaptive pretraining, adversarial alignment, or feature-level normalization. Clarify that the goal is to reduce reliance on site-specific proxies and stabilize attributions on clinically core features.
  • Data curation and labeling: If attributions expose frequent attention to irrelevant templates or section headers, recommend section-aware preprocessing. If label ambiguity drives errors, propose annotation guideline revisions and adjudication protocols targeted to those ambiguous constructs.
  • Model architecture or constraints: If attributions show repeated over-weighting of negation artifacts, suggest adding negation-aware components or constraints, or training with counterfactual augmentations to reduce shortcut learning.

Report these actions in the Results as evidence-driven proposals, then in the Discussion as plans with practical feasibility and risks. Make explicit links to reviewer expectations at AMIA/ACL: transparency about methodology, attention to uncertainty, and demonstration that interpretability methods inform concrete steps rather than stand-alone visuals.

Maintain attribution hygiene. For SHAP, specify background distributions and sampling settings. For IG, document baselines, number of steps, and input normalization. Provide stability checks across seeds and subsamples. Include sensitivity analyses: how do attributions change with tokenization variants or section masking? Report when patterns are stable versus when they are not, and qualify interpretations accordingly.

Finally, synthesize insights into a concise set of claims that respect clinical context:

  • Which errors are most costly under real clinical workflows, and where do they occur? Connect these to strata and stakes.
  • What do attributions imply about model reliance, and are those reliance patterns clinically aligned or spurious? Be precise about evidence strength.
  • Which modifications are justified next (thresholds, calibration, domain adaptation, data/process changes), and how will you evaluate their impact in a follow-up study?

By adhering to this structure—purpose and placement, a rigorous taxonomy and evaluation grid, and disciplined SHAP/IG phrasing that feeds into actionable improvements—you produce a narrative that meets peer-review expectations. You demonstrate that error analysis in clinical NLP is not a decorative appendix but a central scientific argument about model behavior, safety, and utility. This approach turns interpretability outputs into reliable, clinically meaningful insights and a roadmap for iterative model improvement that respects both methodological rigor and patient impact.

  • Frame the narrative with clear purpose, placement, and bounded claims: specify intended use, units, strata, thresholds, and report Methods (design), Results (findings), Discussion (stakes/actions).
  • Build a rigorous evaluation grid linking a structured error taxonomy (label, data, model) to meaningful dataset strata and explicitly tiered clinical stakes, with reproducible sampling and uncertainty reporting.
  • Use SHAP/IG to describe contribution-based reliance patterns (not causality), report local and aggregate attributions, check cross-method consistency, and disclose baselines, settings, and stability.
  • Translate findings into actions—threshold tuning, recalibration, domain adaptation, data/labeling fixes, and model constraints—prioritizing high-stakes errors and documenting how changes will be evaluated.

Example Sentences

  • Our Methods section specifies the error taxonomy, dataset strata, and evaluation criteria, while the Results report quantitative patterns with confidence intervals.
  • We frame the narrative to make conservative, well-bounded claims about safety, generalizability, and utility, avoiding any implication that SHAP or IG reveal causality.
  • Aggregated attributions over the ED-note stratum suggest greater reliance on vital-sign tokens than on symptom descriptors, with higher stakes for false negatives.
  • SHAP indicates that tokens in the Assessment section contributed positively to the sepsis prediction, but attribution instability across baselines is noted.
  • We map strata to clinical stakes so that threshold tuning prioritizes high-acuity false negatives and calibration addresses subgroup misalignment.

Example Dialogue

Alex: I’m finalizing the error analysis—where should the SHAP plots go?

Ben: Put the design details in Methods and the attribution findings in Results, then use Discussion to tie them to clinical stakes.

Alex: Got it. For ED notes, IG highlights negation phrases reducing pneumonia scores, and false negatives cluster near the threshold.

Ben: Then propose stratum-specific threshold adjustments and document calibration with ECE; just avoid causal language about the attributions.

Alex: I’ll also flag small strata with wide intervals and recommend section-aware preprocessing.

Ben: Perfect—keep claims conservative and make the pipeline auditable with sampling rules and adjudication notes.

Exercises

Multiple Choice

1. In an AMIA/ACL paper, where should the detailed attribution findings from SHAP/IG primarily be reported?

  • Methods
  • Results
  • Discussion
Show Answer & Explanation

Correct Answer: Results

Explanation: Methods explains the design (taxonomy, strata, evaluation plan); Results presents what actually happened, including quantitative patterns and attribution findings; Discussion interprets implications and next steps.

2. Which phrasing best aligns with the lesson’s guidance on explainability claims?

  • “The model used vital signs because they cause sepsis.”
  • “Attribution mass concentrated on vital-sign tokens in ED notes, suggesting increased model reliance.”
  • “IG proves the clinical mechanism behind sepsis detection.”
Show Answer & Explanation

Correct Answer: “Attribution mass concentrated on vital-sign tokens in ED notes, suggesting increased model reliance.”

Explanation: The lesson warns against causal claims. Appropriate phrasing emphasizes contribution and reliance patterns without asserting causality.

Fill in the Blanks

The narrative must make conservative, well-bounded claims and avoid implying that SHAP or IG reveal ___.

Show Answer & Explanation

Correct Answer: causality

Explanation: The guidance explicitly states that SHAP/IG should not be framed as proving causal mechanisms.

Rows represent strata and columns represent error types in the evaluation ___, with cells holding metrics and qualitative notes.

Show Answer & Explanation

Correct Answer: grid

Explanation: The core artifact is an evaluation grid linking strata, error types, and stakes with quantitative and qualitative evidence.

Error Correction

Incorrect: We placed our attribution settings and sampling rules in the Results, while the Discussion lists the error taxonomy.

Show Correction & Explanation

Correct Sentence: We placed our attribution settings and sampling rules in Methods, reported attribution findings in Results, and reserved Discussion for stakes and actions.

Explanation: Methods should document design and pipeline details; Results present findings; Discussion ties findings to clinical stakes and proposed actions.

Incorrect: SHAP proves that negation phrases cause lower pneumonia risk in patients across sites.

Show Correction & Explanation

Correct Sentence: SHAP indicates that negation-related tokens contributed negatively to the pneumonia prediction, but this does not imply causality across sites.

Explanation: Use contribution-based language and include caveats; SHAP/IG show reliance patterns, not causal effects.