A new study from Northwestern University in Chicago has identified a concerning trend in the accuracy of skin disease diagnosis across different skin tones, even with the aid of a deep learning system (DLS).
Published in Nature Medicine, the researchers found that prior to assistance from a DLS, physicians were four percentage points less accurate in diagnosing skin conditions on darker skin tones compared to lighter skin. While the study noted the introduction of AI-powered decision support significantly improved diagnostic accuracy for all physicians—boosting it by 33% for dermatology specialists and 69% for primary care providers—AI also appeared to exacerbate existing disparities.
The study included board-certified dermatologists (n=389) and primary-care physicians (n=459) from 39 countries. The goal was to evaluate the accuracy of diagnoses submitted by physicians in a store-and-forward teledermatology simulation. Physicians were presented with 364 images spanning 46 skin diseases and asked to submit up to four differential diagnoses. DLS decision support improved the diagnostic accuracy of both specialists and generalists by more than 33% but exacerbated the gap in the diagnostic accuracy of generalists across skin tones. According to the authors, these results demonstrate that well-designed physician-machine partnerships can enhance the diagnostic accuracy of physicians but also illustrate that success in improving overall diagnostic accuracy does not necessarily address bias.
“We suspected bias, but specialists don't have this AI-exacerbated bias, whereas primary care physicians do,” said lead author Matthew Groh, assistant professor at the Kellogg School of Management, in a press release. “When a specialist sees advice from AI, they take their own vast knowledge into account when diagnosing. Whereas primary care physicians might not have that same deep intuition of pattern matching, so they go with the AI suggestion on patterns that they are aware of.”
Specifically, the AI assistance increased the accuracy gap between light and dark skin tones by five percentage points for primary care physicians. “Our study reveals that there are disparities in accuracy of physicians on light versus dark skin,” said Groh. “And in this case, it’s not the AI that is biased, it’s how physicians use it.”
The study suggests that specialists, with their deeper clinical knowledge, are better able to contextualize the AI's recommendations. Primary care providers, on the other hand, may be over-relying on the AI's output, especially for skin conditions they may have less experience diagnosing.
These findings underscore the need to address systemic biases in medical education and practice, where representation of darker skin tones remains limited. Groh hopes this research will spur further exploration of how human-AI partnerships can be optimized to serve all patients equitably.
"We have to find a way to incorporate underrepresented demographics in our research," Groh said. "That way we will be ready to accurately implement these models in the real world and build AI systems that serve as tools designed to avoid the kind of systematic errors we know humans and machines are prone to."
Comments