
Dr. AI says you have sleep apnea: measuring the accuracy of patient-facing AI in sleep medicine
You type “Do I have sleep apnea?” into ChatGPT and get back a detailed, confident, doctor-sounding answer in seconds. It feels like getting a second opinion without waiting months for a sleep study. But is that answer actually correct?
A new study published in the Journal of Clinical Sleep Medicine puts that question to the test. Dr. Christine H.J. Won of Yale University School of Medicine and the VA Connecticut Healthcare System evaluated how accurately patient-facing artificial intelligence tools (the large language models powering chatbots like ChatGPT, Gemini, and Claude) answer real questions about sleep apnea screening, diagnosis, and treatment. The findings are a sobering check on the hype surrounding AI in medicine.
What they found
Dr. Won designed a systematic evaluation of AI-generated responses to common patient questions about obstructive sleep apnea (OSA). The questions covered the full spectrum of clinical concerns: risk factors, symptoms, when to seek a sleep study, interpretation of home sleep test results, treatment options including PAP therapy and oral appliances, and long-term management.
Using validated assessment tools (including the QAMAI framework, the Quality Analysis of Medical Artificial Intelligence tool developed by Vaira and colleagues in 2024), the study scored each response for accuracy, completeness, and safety. Would the AI steer a patient toward appropriate care? Would it flag dangerous misconceptions? Or would it confidently offer advice that sounds plausible but is medically wrong?
The study builds directly on prior work by Hack et al. (2026, also in JCSM), which compared generative AI against traditional web search for OSA patient education. That earlier study found AI could match or sometimes exceed web search in the quality of information it provided. But it also found that AI responses carried a unique risk: they sound authoritatively correct even when they are not, making it harder for patients to spot errors.
Dr. Won’s results sharpen this picture. While LLMs often produced responses that were broadly reasonable, accuracy varied significantly depending on the specific question asked. The tools performed best on general, well-documented topics like “What are the symptoms of sleep apnea?” or “How is sleep apnea treated?” These are topics where information is abundant in training data and relatively stable over time. The tools struggled most with nuanced clinical judgment: interpreting borderline diagnostic results, recommending follow-up testing, and weighing treatment options for patients with multiple comorbidities.
Why it matters
The stakes here are not academic. Sleep apnea affects an estimated 936 million adults worldwide, according to the most recent global prevalence data, and the vast majority remain undiagnosed. Patients are increasingly turning to AI as a first stop for health information, sometimes before ever seeing a physician. A 2025 survey cited in the study found that roughly one in five adults had used a generative AI tool for a health-related question, and that number is climbing.
For sleep medicine, which already struggles with long wait times for sleep studies and a shortage of board-certified sleep specialists, AI could either be a powerful triage tool or a source of dangerous misdirection. If patients act on incorrect AI-generated advice (skipping a medically indicated sleep study, adjusting their own PAP therapy settings, or dismissing warning signs of more serious conditions like central sleep apnea or hypoventilation), the consequences could be serious.
The problem is compounded by the fact that AI chatbots are not designed for medical decision-making. They are designed to produce plausible, fluent text. When a patient asks a question that has no clear answer or requires individualized clinical judgment, the model will produce something. And that something may be incomplete, misleading, or flatly wrong.
Limits of the study
The study, like all early work in this area, has important limitations. It evaluated a snapshot of AI tools at a single point in time; LLMs are updated frequently, and accuracy can shift dramatically between model versions. The assessment also relied on expert review of responses rather than real-world patient outcomes. We do not yet know how often patients actually modify their care based on AI advice, or what harms result when they do.
Additionally, the study focused specifically on obstructive sleep apnea. Other sleep disorders (insomnia, restless legs syndrome, narcolepsy, circadian rhythm disorders) present different challenges for AI, and the accuracy of LLM responses likely varies across these conditions.
The bottom line
This study does not suggest that AI has no role in sleep medicine. The Hack et al. companion paper and other recent systematic reviews (including those by Abd-Alrazaq et al. 2024 on wearable AI for sleep apnea detection, Banjade et al. 2025 on AI in sleep medicine broadly, and Haghighat et al. 2025 on AI diagnostic accuracy for OSA) all point to genuine promise. AI can help patients understand their condition, prepare for clinic visits, and navigate treatment options. Used appropriately, it may improve health literacy and access to care.
But accuracy is not guaranteed. The key takeaway from Dr. Won’s work is that patients and clinicians alike need to treat AI-generated health advice with the same skepticism they would apply to any other unverified source. A chatbot is not a doctor. It cannot review your medical history, interpret your symptoms in context, or tell you when it does not know the answer.
For now, the safest approach is simple: ask the chatbot for background information, but bring the actual medical decisions to a human sleep specialist. The AI may be a useful starting point, but it is not the finish line.
Source: Won CHJ. Dr. AI says you have sleep apnea: measuring the accuracy of patient-facing AI in sleep medicine. J Clin Sleep Med. 2026;22(1):99. DOI: 10.1007/s44470-026-00119-2. PMID: 42387082.

