Medical researchers identify chatbots' advice as hazardous or questionable in 5-13% of cases
In a recent study titled "Large language models provide unsafe answers to patient-posed medical questions", researchers have highlighted several critical safety issues with the use of large language models (LLMs) as medical advisors.
- Hallucinations and Inaccuracies: One of the main concerns is the generation of fabricated or incorrect medical information, known as "hallucinations." This can lead to potentially harmful misinformation, especially in high-risk settings like drug safety or clinical decision making.
- Decline in Disclaimers: Over time, LLMs increasingly omit safety disclaimers that clarify the AI output is not a substitute for professional medical advice. The lack of disclaimers may cause users to overtrust inaccurate or unsafe AI-generated responses.
- Low Accuracy and Risk of Misclassification: In medical tasks such as coding or risk stratification, LLMs display low accuracy with considerable false positives, causing risks of misinterpretation and harmful clinical decisions.
- Automation Bias: Clinicians and patients may over-rely on confident but incorrect AI-generated information without sufficient verification, increasing the danger of unsafe outcomes.
- Challenges in Governance and Monitoring: Traditional healthcare governance frameworks are inadequate for overseeing AI safety. There is a need for real-time monitoring, dynamic governance, and risk-based management systems to ensure safe deployment of LLMs in healthcare.
- Need for Guardrails: Deployments must include technical guardrails to detect anomalies, flag incorrect data, convey uncertainty, and integrate human oversight to mitigate errors and hallucinations.
The study found that even the most advanced AI chatbots, including ChatGPT and Google's Gemini, gave dangerously wrong answers when asked for medical advice. Llama, a free and open-source (FOSS) model, had the most issues in the tests, despite being heavily used in live professional contexts.
The authors concede that not all behavioral change in LLMs will necessarily improve any particular use case and express the need for a standard and widely-accepted 'live' benchmark addressing this task. Despite the potential benefits of LLMs in improving human health, the study identifies several serious safety issues that need to be addressed to ensure safe and effective use in healthcare.
Responsible LLM use in medicine requires hybrid human-AI workflows, continuous monitoring, effective disclaimers, and robust safety guardrails to prevent patient harm.
- In light of the study, it's crucial to recognize that even advanced AI chatbots, such as ChatGPT and Google's Gemini, can provide misleading medical advice, posing potential risks to health and wellness.
- As the use of large language models (LLMs) expands in medical-conditions research and health-and-wellness fields, it's essential to establish robust technology solutions for ensuring these systems are governed, continuously monitored, and integrated with human oversight to minimize dangers and maintain a high standard of safety.