Back
Science

Studies Highlight Contextual Errors and Misinformation Risks in Medical AI Deployment

View source

Recent research from Harvard Medical School and the Icahn School of Medicine at Mount Sinai identifies significant challenges for the deployment and reliability of medical artificial intelligence (AI) systems. One study points to "contextual errors" as a primary reason many AI models fail to transition into clinical settings, while another reveals that medical AI can disseminate false medical claims, particularly when presented within familiar clinical or social media language.

Both studies underscore the necessity for enhanced development practices, rigorous testing, and integrated safeguards to ensure the safe and effective clinical integration of AI.

Contextual Errors Limit AI Application in Clinical Settings

Research published on February 3 in Nature Medicine by Marinka Zitnik, an associate professor of biomedical informatics at Harvard Medical School, and her colleagues, identified "contextual errors" as a key barrier preventing thousands of medical AI models from successful clinical deployment.

Contextual errors occur when AI models generate responses that, while potentially useful, are not accurate or appropriate for the specific context of their use, such as a particular medical specialty, geographic region, or socioeconomic condition.

Zitnik characterized this as a broad limitation across existing medical AI models. The researchers indicate that these errors stem from the absence of critical contextual information in the datasets used for AI model training, leading to recommendations that may appear reasonable but lack relevance or actionability for specific patients.

To address these limitations, the study proposes three primary steps:

  • Data Inclusion: Directly incorporating specific contextual information into AI model training datasets.
  • Enhanced Benchmarks: Developing more robust computational benchmarks for evaluating models post-training.
  • Architectural Integration: Designing model architectures that inherently consider context.

Examples provided illustrate how a lack of context can manifest as errors:

  • Medical Specialties: AI models trained in a single specialty may struggle with complex patient symptoms spanning multiple specialties, leading to irrelevant responses. Multi-specialty trained models with real-time context-switching capabilities are suggested as a solution.
  • Geography: Identical AI responses to medical questions in different geographic locations are likely inaccurate due to regional variations in disease prevalence, approved treatments, and procedures. Integrating geographic data is proposed for location-specific responses.
  • Socioeconomic Factors: AI models may overlook patient barriers such as transportation or childcare difficulties, leading to impractical recommendations. Models that consider these factors could suggest more realistic solutions, promoting equitable access to care.

Medical AI Systems Susceptible to Misinformation

A separate study conducted by the Icahn School of Medicine at Mount Sinai and collaborators, published in The Lancet Digital Health, found that medical AI systems can disseminate false medical claims. The research analyzed over a million prompts across nine leading language models.

The study observed that these AI systems could repeat medical misinformation when encountered within realistic hospital notes or social media health discussions.

This suggests that current safeguards may not consistently distinguish factual information from fabricated claims when presented in familiar clinical or social media language.

To test this vulnerability, researchers exposed the models to three types of content:

  1. Real hospital discharge summaries from the Medical Information Mart for Intensive Care (MIMIC) database, each with a single fabricated recommendation.
  2. Common health myths from Reddit.
  3. 300 physician-validated short clinical scenarios, presented with varying emotional and leading phrasing.

An example cited involved a fabricated discharge note advising patients with esophagitis-related bleeding to "drink cold milk to soothe the symptoms." Several AI models accepted this incorrect statement as standard medical guidance rather than flagging it as unsafe.

Dr. Eyal Klang, Chief of Generative AI at Mount Sinai, stated that current AI systems often default to treating confident medical language as accurate, even if incorrect, indicating that the form of a claim can influence AI's acceptance more than its factual accuracy. Dr. Girish N. Nadkarni, Chair of the Windreich Department of Artificial Intelligence and Human Health at Mount Sinai, emphasized the need for built-in safeguards to verify medical claims before AI presents them as facts.

Broader Challenges and Future Outlook

Beyond these specific findings, both studies allude to broader challenges in medical AI development. Fostering trust among patients, clinicians, and regulatory bodies is crucial, requiring transparent and interpretable model recommendations, and models capable of indicating uncertainty. Developing human-AI collaboration interfaces that support bidirectional information exchange and are tailored to user expertise is also identified as crucial.

Despite these challenges, medical AI is recognized for its potential to enhance the efficiency of daily medical tasks, such as drafting patient notes and conducting scientific literature searches. There is particular optimism for AI's capacity to improve treatment through context-switching models that can adapt outputs based on the stage of treatment.

Ensuring that medical AI models provide benefit requires collective effort from the medical AI community, involving responsible development practices, rigorous real-world testing, and the establishment of clear deployment guidelines.

Both studies received support from grants, including those from the National Institutes of Health.