Image by Somkid Thongdee from Getty Images via Canva.
Meghan Abate '25 is a neuroscience major and an economics minor and she is a 2024-25 health care ethics intern at the Markkula Center for Applied Ethics at Santa Clara University. Views are her own.
There is no question that artificial intelligence (AI) is making its way into our daily lives. In the health care setting, many physicians have come to rely on clinical-support algorithms. That is, algorithms that analyze a patient’s status based on a collection of previous data in order to predict risks, diagnose diseases, and ultimately support physicians in their decision-making. Some of these algorithms may prove to work well; however, many academics worry that they can lead to poor health outcomes for certain minorities, especially for the Black population in the United States.
For example, the Maternal Fetal Medicine Units (MFMU) Network’s Vaginal Birth after Cesarean (VBAC) calculator requires the input of patient information, such as age, weight, obstetric history, and previously race, to predict the success of delivering a child vaginally for patients who previously underwent a cesarean delivery. In 2020, the VBAC calculator got some attention for systematically predicting lower success rates for Black health care and Hispanic women, thereby encouraging them to endure cesarean sections. These results are alarming because cesarean sections are known to have increased risks of postpartum complications relative to a well-executed VBAC, and women of color already tend to have more postpartum complications and mortality rates than white women. Due to concerns that the VBAC calculator would increase poor health outcomes for women of color, the MFMU Network removed race from the algorithm in 2021.
The VBAC calculator is only one example of many whereby the inclusion of race in clinical-support algorithms has encouraged racial health disparities. In general, many scholars argue that these algorithms should not include race because of how algorithms are trained to look for patterns. When developers make algorithms, they often use “training data,” or past data sets, to train the algorithm to associate certain health outcomes with certain types of people. For example, the algorithm may learn that people with a history of hypertension are more likely to experience a cardiac event than those with normal blood pressure. However, this training process gets tricky when the algorithm accounts for race, because historical data may reflect race-based inequalities.
Algorithms tend to overestimate the health status of minorities, due to reliance on false beliefs, or simply because the data reflects our health care systems’ lack of accessibility and high quality patient care for racial minorities (particularly in obstetrics). This argument against the inclusion of race presents valid and pressing concerns. However, I argue that the inclusion of race in clinical-support algorithms is only a small fraction of a much larger issue.
Inadvertent Racial Discrimination
Even without the inclusion of race, it is evident that algorithms may still systematically recommend the undertreatment of minorities, particularly Black Americans. One 2019 study investigates the possibility of biased training data associated with a particular large-scale, decision-support algorithm. Many companies such as UnitedHealth Group rely on this algorithm to flag patients who are likely to require “‘high-risk care management programs.” Flagged patients may be at a high risk for various conditions, such as diabetes. Authors of the study report that even though patients who self-identified as Black “have 26.3% more chronic illnesses” and therefore “significantly more illness burden” than patients who self-identified as white, the algorithm in question actually scores Black patients similarly to white patients. As a result, Black patients were less likely to be enrolled in this care management program relative to white patients. The study does not name the algorithm and it is unclear whether it has been modified since.
As it turns out, the algorithm was using health care costs as a proxy for health care needs. Since Black patients tend to have less access to health care than white patients due barriers of structural racism, they incur fewer health care costs on average. When the algorithm uses this metric to measure health care needs, it inadvertently assumes that Black patients have fewer health care needs and therefore may require less medical attention. It is likely that when physicians use this kind of risk-calculator tool, they may be advised to undertreat Black patients.
It is notable that this algorithm excludes race, and still it suggests the undertreatment of Black patients. The takeaway here is that even when algorithms do not evaluate race as an intended predictor of the output, they may inadvertently end up recommending outcomes that could increase racial disparities because of the way they are trained.
Other algorithms might inadvertently end up performing worse for certain racial minorities because they are trained on data that inherently lacks diversity. One example of this is skin-cancer detection apps. The imaging of suspicious-looking skin tags, moles, and spots is one method by which physicians might flag malignant skin conditions. Developers have created smartphone apps whereby a user can upload a photograph of their skin and receive some advice or diagnosis based on the algorithm’s interpretation of that image. Thus, the algorithm must be trained on pre-existing images that show what malignant skin conditions typically look like compared to benign skin spots. The unfortunate reality is that there are far fewer photographs of dermatologic conditions on darker skin tones, relative to the abundance of examples on light skin. One 2024 study that searches for malignant skin conditions on Google Images found that 95% of 1,200 images evaluated were images showing the dermatologic condition on light skin, while the remaining images showed either darker skin tones, dark and light skin tones, or were inconclusive. When algorithms that are meant to detect skin cancers rely on this kind of training data, they work incredibly well on light skin but are far less accurate on darker skin tones.
To improve these disparities, researchers should aim to increase the diversity of subjects in data collection. This need is relevant to both dermatologic imaging as well as all other kinds of research. In this new age of artificial intelligence whereby algorithms rely on previous research to make present conclusions, diversity in scientific data has never been more important.
Developers of AI should also aim to be more transparent about the limitations of their training data. Often it is not advertised that these dermatologic imaging apps do not work well on dark skin tones. People who are unaware of these limitations may not get accurate results, thereby reinforcing the barriers racial minorities already face in receiving good quality health care.
The Principle of Justice
As author and bioethicist Lewis Vaughn puts it, the bioethical principle of justice requires that “people should be treated the same unless there is a morally relevant reason for treating them differently–and racial difference is not morally relevant.” In other words, “equals should be treated equally.” Unfortunately, the United States has a long history of treating non-white races as less than equal to white individuals. Systemic, racial prejudice is so deeply rooted in our society that it has the ability to infiltrate these algorithms in so many ways. Whether minority groups have access to health care or whether they receive patient care that is at least as good as white individuals, these are critical nuances that severely limit the power of prior data in predicting future outcomes. When algorithms are trained on data or proxies that reflect these disparities, the resulting clinical support tools have the ability to widen race-based inequality in health care on an exceptionally wide-spread scale.
In a perfect world, these clinical support algorithms would work equally well for everyone. If this were the case, the benefits and potential of algorithms in the health care space would be undeniable. Clinical errors and physician fatigue could be reduced, diagnostics and pathology could be improved, and patient care could become more organized.
In the meantime, users, developers, and the general public should become educated on the unresolved issues and ethical implications of these algorithms. Currently, many algorithms are proprietary, meaning their inner workings can be considered intellectual property. As a result, users and researchers are not able to evaluate how the algorithm was trained and what its shortcomings may be. For this reason, scholars refer to these systems as “black boxes.” While protecting intellectual property serves some ethical purpose, this protection should not excuse the obligation developers have to own up to the limitations of their designs, when they are known. Where the limitations are not foreseeable, physicians should implement these algorithms sparingly and with the patient’s informed consent; physicians should still read over each patient’s case and continue to draw their own conclusions where it is applicable.
As improving the diversity of empirical scientific research may take some time, I encourage users of these algorithms to proceed with caution. It is my hope that with these necessary improvements, algorithmic AI will become an accurate and ethical tool in the health care space.