Two doctors examine MRI brain scans. Photo by Vitaly Gariev on Unsplash.
Evan Hackstadt is a computer science major with minors in biology and math. He is a 2025-26 health care ethics intern at the Markkula Center for Applied Ethics at Santa Clara University. Views are his own.
Artificial Intelligence (AI) is becoming increasingly used across health care. Complex machine learning models such as deep neural networks have achieved impressive accuracy at clinical support tasks, from disease detection to personalized treatment planning. However, the black box nature of these models raises a number of concerns – namely, is it ethical to use a model that cannot explain its decisions for care? Explainability techniques have been deployed in response to this issue, but they have flaws of their own that threaten the duty of clinicians and the rights of patients. What level of interpretability should we require for clinical AI? What techniques and regulations are needed to protect patients while still advancing care?
The Problem With Black Box Models in Health Care
Deep neural networks use many parameters and layers of nodes to learn complex patterns in data. They can be highly accurate, but are known to come with a number of risks: encoding systemic bias, using problematic shortcuts to make predictions, and struggling to generalize to real clinical settings. Additionally, these models are “black box” in nature: there is no way to know how the model arrived at a prediction, because its structure is too complex.
Black box clinical AI violates a patient's right to autonomy, a key principle of clinical ethics. For example, if a clinician uses a black box model to justify treatment, they will be unable to explain how the model arrived at its prediction. Thus, fully informed consent cannot be obtained from the patient, violating autonomy and eroding trust.
Furthermore, it can be difficult to detect bias in black box models. If a clinical support model is biased, it broadly violates ethical principles of justice.
The consequences of error-prone black box models in health care have already been seen in various cases, such as the proprietary Epic Sepsis model. A 2021 investigation found that the model (at that time) had worse performance than claimed by the company, created alert fatigue through false positives, and took shortcuts by predicting sepsis only based on signs that the clinician themselves already suspected sepsis (e.g. ordering a diagnostic test). With the model being a black box, there was no way to explain model predictions, allowing it to hide its problematic circular logic that it had learned.
Explainability: A Band-Aid for Black Box Models
In response to the black box nature of deep neural networks, various “explainability” methods have been developed in an attempt to capture their decision-making process (e.g. SHAP, LIME, GradCAM). Most of these are post-hoc tests that attempt to explain model outputs after they have been generated, usually by estimating the importance of different inputs to the output. However, there are three key concerns with using explainability on individual predictions.
First, post-hoc explainability techniques are unreliable. These methods merely estimate or approximate the model’s decision-making process; they can never be completely faithful to the original model, otherwise they would equal the original model. A 2025 meta-analysis found explainability techniques in medical imaging to have low fidelity scores and inconsistency under noisy inputs. Likewise, heatmaps and saliency maps have been found to be unstable when imperceptible manipulations are applied to inputs. Saliency maps often provide visualizations that are completely independent of the model.
The second key issue with explainability is that it has no bearing on the correctness of the prediction. Even if an explanation is faithful, the model’s output could simply be wrong in the first place. But explanations are often unfaithful or uninterpretable themselves, which only adds an additional source of error to an incorrect model prediction.
This leads into the third key concern: the explainability trap. Also known as automation bias, this is the phenomenon where adding an explanation to a model’s output makes the user more likely to trust that output even if they don’t understand the explanation – despite explanations having no bearing on accuracy. Furthermore, even if a given explanation is faithful, the clinician must still make a normative evaluation of what the explanation means and if the prediction is trustworthy.
Explainability is Ethically Insufficient
Despite sounding like a convenient solution, explainability techniques are ethically insufficient for salvaging the ethical principles violated by black box AI models in a health care context.
Since current explainability methods are often unreliable at explaining individual predictions, this creates a risk of deception. A deceptive explanation would violate patient autonomy (informed consent and truth-telling) and nonmaleficence.
The fact that explainability methods have no bearing on correctness becomes ethically problematic when the underlying model has poor accuracy, is taking shortcuts, and/or is biased. Explainability methods may fail to expose these model flaws and thus fail to uphold the principle of justice.
The explainability trap heightens this issue: given a flawed model, adding post-hoc explainability will merely make its outputs appear more convincing – likely perpetuating systemic bias and dangerous predictions in high-risk scenarios. The explainability trap directly violates nonmaleficence, since disguising a wrong prediction would be actively doing harm to patients.
A classic case that illustrates these concerns is IBM’s Watson for Oncology. The model was supposed to be a treatment recommendation system for cancer patients, but contained a number of foundational flaws that often led to unhelpful, false, or even dangerous recommendations. For example, given a patient whose cancer had not spread to the lymph nodes, the model recommended a chemotherapy drug used only for cancer with lymph involvement. And to support its recommendation, Watson cited a study demonstrating the efficacy of this drug. In this specific example, the failure was obvious. But the ethical danger is clear: post-hoc explanations make false outputs more trustworthy. This would be even more dangerous with mathematical explainability techniques which are less interpretable to clinicians. Watson failed in both interpretability and validation.
Interpretability: A Promising Alternative
A different response to the black box problem of deep learning is to design inherently interpretable models that are not black boxes in the first place. While post-hoc explainability tries to put a bandage over a black box model, “ante-hoc” interpretability builds a model whose structure (i.e. decision-making process) can be intuitively understood. For example, a basic example of an inherently interpretable model would be a decision tree (which is like a flowchart) or a rule-based system.
It is a common assumption that interpretable models must perform worse than black box deep learning models. While this “accuracy-interpretability tradeoff” is commonly assumed, recent research has proved this to be fairly small in clinical settings or rejected the argument entirely.
More importantly, interpretable models offer a number of ethical benefits. Since it is possible to see exactly how a model arrived at a given prediction, patients can receive model recommendations with fully informed consent. The risk of harm that comes with the explainability trap is mitigated. And interpretability will readily expose bias or shortcut learning that threaten the principle of justice. If an accuracy penalty does exist, this represents a small loss of beneficence, but in exchange for upholding the three other principles of nonmaleficence, autonomy, and justice.
Moving Forward: Combining Validation, Interpretability, and Explainability
When it comes to clinical AI models, the obsession with explainability must be left behind. Instead, regulations need to first prioritize rigorous validation, then inherent interpretability, and finally post-hoc explainability only for global model audits.
Rigorous validation of models, similar to phased clinical trials for drugs, have been proposed by multiple authors. In fact, some have even argued that rigorously-validated black box models are preferable to less-validated interpretable models. Regardless, external validation on diverse populations is of utmost importance to confirm the reliability and fairness of clinical AI models before they are deployed.
Inherently interpretable models should be prioritized over black box models, particularly when there is little-to-no difference in accuracy. When there is a gap in accuracy, both black box and interpretable models should be made available to hospitals and clinicians.
Finally, explainability can be used by researchers for global model audits to check for bias or shortcut learning, but never to explain individual predictions to patients.
Current EU and US FDA regulations on clinical AI are not promising. On January 6, 2026, the FDA exempted certain clinical decision support tools from regulation. This represents an ethical crisis as clinical AI continues to be integrated into the US health care system. Rather than blindly explaining, we must shift towards rigorous validation and interpretable architectures that uphold beneficence, nonmaleficence, autonomy, and justice for all patients.