In short
Almost half of AI chatbot responses to well being questions had been rated “considerably” or “extremely” problematic in a BMJ Open audit of 5 main chatbots.
Grok produced considerably extra “extremely problematic” responses than statistically anticipated, whereas diet and athletic efficiency questions fared worst throughout all fashions.
No chatbot produced a totally correct reference listing.
Almost half of the well being and medical solutions offered by in the present day’s hottest AI chatbots are incorrect, deceptive, or dangerously incomplete—and so they’re delivered with whole confidence. That is the headline discovering of a brand new peer-reviewed research printed April 14 in BMJ Open.
Researchers from UCLA, the College of Alberta, and Wake Forest examined 5 chatbots—Gemini, DeepSeek, Meta AI, ChatGPT, and Grok—on 250 well being questions protecting most cancers, vaccines, stem cells, diet, and athletic efficiency. The outcomes: 49.6% of responses had been problematic. Thirty p.c had been “considerably problematic,” and 19.6% had been “extremely problematic”—the form of reply that might plausibly lead somebody towards ineffective or harmful therapy.
To emphasize-test the fashions, the group used an adversarial method—intentionally phrasing inquiries to push chatbots towards unhealthy recommendation. Questions included whether or not 5G causes most cancers, which different therapies are higher than chemotherapy, and the way a lot uncooked milk to drink for well being advantages.
“By default, chatbots don’t entry real-time knowledge however as a substitute generate outputs by inferring statistical patterns from their coaching knowledge and predicting doubtless phrase sequences,” the authors write. “They don’t motive or weigh proof, nor are they capable of make moral or value-based judgments.”
That is the core downside. The chatbots aren’t consulting a physician—they’re pattern-matching textual content. And pattern-matching on the web, the place misinformation spreads sooner than corrections, produces precisely this type of output.
The researchers proceed: “This behavioural limitation signifies that chatbots can reproduce authoritative-sounding however doubtlessly flawed responses.” Out of 250 questions, solely two prompted a refusal to reply—each from Meta AI, on anabolic steroids and different most cancers therapies. Each different chatbot stored speaking.
Efficiency assorted by subject. Vaccines and most cancers fared finest—partly as a result of high-quality analysis on these topics is well-structured and extensively reproduced on-line. Diet had the worst statistical efficiency of any class within the research, with athletic efficiency shut behind. For those who’ve been asking AI whether or not the carnivore weight loss program is wholesome, the reply you bought was in all probability not grounded in scientific consensus.
Grok stood out for the incorrect causes. Elon Musk’s chatbot was the worst performer of any mannequin examined. Of its 50 responses, 29 (58%) had been rated problematic total—the best share throughout all 5 chatbots. Fifteen of these (30%) had been extremely problematic, considerably greater than anticipated underneath a random distribution. The researchers join this on to Grok’s coaching knowledge: X is a platform recognized for spreading well being misinformation quickly and extensively.
Citations had been a separate catastrophe. Throughout all fashions, the median completeness rating for references was simply 40%—and never one chatbot produced a totally correct reference listing. Fashions hallucinated authors, journals, and titles. DeepSeek even acknowledged it: The mannequin advised researchers its references had been generated from coaching knowledge patterns “and should not correspond to precise, verifiable sources.”
The readability downside compounds the whole lot else. All chatbot responses scored within the “Troublesome” vary on the Flesch Studying Ease scale—equal to varsity sophomore-to-senior degree. That exceeds the American Medical Affiliation’s advice that affected person training supplies shouldn’t transcend sixth-grade studying degree.
In different phrases, these chatbots apply the identical trick politicians {and professional} debaters are likely to do: shoot you so many technical phrases in so little time that you find yourself considering they know greater than they do. The more durable one thing is to grasp, the simpler it’s to misread.
The findings echo a February 2026 Oxford research lined by Decrypt that discovered AI medical recommendation no higher than conventional self-diagnosis strategies. In addition they monitor with broader considerations about AI chatbots delivering inconsistent steerage relying on how questions are framed.
“As the usage of AI chatbots continues to broaden, our knowledge spotlight a necessity for public training, skilled coaching, and regulatory oversight to make sure that generative AI helps, fairly than erodes, public well being,” the authors conclude.
The research solely examined 5 free-tier chatbots, and the adversarial prompting methodology might overstate real-world failure charges. However the authors are direct: the issue is not the perimeter circumstances. It is that these fashions are deployed at scale, utilized by non-experts as search engines like google, and configured—by design—to nearly by no means say “I do not know.”
Day by day Debrief E-newsletter
Begin each day with the highest information tales proper now, plus unique options, a podcast, movies and extra.

