Sleep EEG foundation models reveal within-stage microstructure that improves health screening beyond traditional stages

Sleep EEG foundation models reveal within-stage microstructure that improves health screening beyond traditional stages

For decades, sleep medicine has relied on a five-stage system to summarize an entire night of brain activity: wake, N1, N2, N3 (deep sleep), and REM. These stages are the language clinicians use to diagnose sleep disorders, assess sleep quality, and guide treatment. But a provocative new preprint suggests this framework may be leaving valuable information on the table.

This article covers a preprint. The research described below has not yet been peer-reviewed and should be interpreted with caution.

Coon and Ogg, in a study posted to Research Square on June 26, 2026, trained self-supervised transformer models on more than 11,000 overnight EEG recordings, without using human-labeled sleep stages, and asked whether the models could learn richer representations of sleep physiology than the traditional staging system captures. Their answer, across a battery of health outcome predictions, is a qualified yes.

What they found

The researchers used 11,261 overnight polysomnography recordings drawn from multiple clinical and population-based cohorts. They trained transformer neural networks using self-supervised learning (SSL), a technique in which a model learns meaningful patterns from unlabeled data by solving a pretext task, in this case, predicting masked segments of the EEG signal. The resulting “foundation model” representations were then probed for their ability to predict a range of outcomes: body mass index (BMI), age, sex, the apnea-hypopnea index (AHI), and functional measures related to sleep and daytime performance.

The experiments included careful architectural controls. The investigators compared the SSL-trained models against (1) transformers trained from random initialization directly on each downstream task (no pretraining), (2) transformers pretrained in a supervised fashion on the standard five sleep stages, and (3) traditional spectral summary features derived from the EEG.

The results were not uniform across all outcomes, but the pattern was clear in key areas. SSL pretraining outperformed task-specific training from scratch for several outcome predictions. More notably, when compared against the five-stage-supervised pretraining approach, the SSL models showed meaningful advantages for BMI and age prediction. For AHI, sex, and functional outcomes, the differences were smaller, sometimes nominal, and in some cases not reliably distinguishable from the supervised baseline.

A particularly instructive finding came from nested control analyses. When the researchers asked whether the SSL-derived representations added incremental value beyond what could be explained by covariates, conventional stage summaries, spectral summaries, and even a matched five-stage representation, the answer was yes. The self-supervised models captured health-relevant signal that the staged representation missed.

Why it matters

If confirmed and refined, this work has implications that extend well beyond a single algorithmic trick. The finding that SSL models recover the stage scaffold without ever being shown a labeled stage suggests that the traditional five-stage framework does capture real, reproducible structure in sleep EEG. But the models also appear to encode finer-grained, stage-anchored microstructure that carries task-specific health information.

This is conceptually important. It implies that within any given sleep stage, there is meaningful physiological variation that current clinical scoring discards. Two hours of N2 sleep that look identical on a hypnogram may encode very different health signals depending on subtle EEG features, features that human scorers are not trained to recognize and that standard summary metrics (spectral power bands, stage percentages) fail to capture.

From a practical standpoint, the ability to predict BMI and age from sleep EEG beyond what staging provides opens the door to using sleep recordings as a broader health screening tool. Sleep is already understood as a window into systemic physiology; this work suggests the window may be much wider than we have known how to look through.

For the sleep field, which has long wrestled with the limits of the Rechtschaffen and Kales / AASM staging framework, this study adds computational weight to the argument that better measurement tools are needed. Foundation models trained across massive, diverse datasets may be one such tool.

Limits of the study

As a preprint, this work has not undergone peer review, and the conclusions should be treated as provisional. The study also has several methodological limitations that warrant attention.

The performance advantages of SSL over stage-supervised pretraining were not uniform across all outcomes. For some clinically important measures, including AHI and functional outcomes, the added value of SSL was modest or inconsistent. This raises questions about whether the approach is broadly useful or mainly effective for certain types of predictions.

The training data, while large at over 11,000 recordings, may still contain biases in cohort composition, recording equipment, and scoring conventions. The models were evaluated on held-out data within the same cohort pools; independent validation in entirely new populations will be essential.

Finally, the study does not address the practical deployment question. Even if SSL-derived representations carry richer information, translating those representations into clinically actionable tools requires additional work: interpretability methods to understand what the models are actually detecting, regulatory validation, and integration into clinical workflows.

The bottom line

Coon and Ogg provide a compelling computational demonstration that self-supervised learning on sleep EEG can recover the standard sleep stage architecture while also preserving finer-grained physiological detail that improves health screening beyond traditional staging. The effects are clearest for BMI and age prediction, with more mixed results for other outcomes.

The work adds to growing evidence that foundation models trained on physiological time series can extract information that human-engineered features and clinical staging systems leave behind. But it also underscores that these models are not magic: their advantages are domain-specific, require large training data, and need independent replication before they can inform clinical practice.

For now, the message is one of measured excitement. Sleep staging has been a remarkably durable framework, and this study suggests it is not wrong. It may simply be incomplete.

Source: Coon WG, Ogg M. Sleep EEG foundation models reveal within-stage microstructure that improves health screening beyond traditional stages. Res Sq [Preprint]. 2026 Jun 26:rs.3.rs-9044150. DOI: 10.21203/rs.3.rs-9044150/v2. PMID: 42396520. Note: This is a preprint and has not been peer-reviewed.