What actually drives wearable sleep staging performance? It is not the model

Performance gains attributed to sophisticated machine learning models for wearable sleep staging may actually reflect dataset characteristics and evaluation choices, according to a study published in Sleep Advances. A simple baseline that estimates wake probability from raw activity data alone can explain much of the variance in performance across architectures.

What they found

Researchers tested three model types — logistic regression, random forest, and an LSTM deep neural network — on actigraphy data from three datasets: a new clinical collection (SleepAccel-Clinical, 28 individuals with sleep apnea), previously collected healthy controls (SleepAccel, 31 participants), and the DREAMT dataset (100 individuals with suspected sleep disorders).

Alongside these, they introduced an “Easy To Classify” (ETC) wake model: a simple baseline that smooths and scales activity in a short time window to estimate wake probability.

Key results:

The ETC baseline correlated significantly (p < 0.01) with all model performances across architectures, datasets, and training conditions. In other words, the simple activity-based model could predict how well any more complex model would perform.
Training data composition mattered more than model choice. Models trained exclusively on healthy individuals performed poorly on datasets with sleep disorders. Including individuals with obstructive sleep apnea in the training set significantly boosted performance on clinical datasets.
Easy-to-classify wake epochs inflate AUROC. Preceding even a few minutes of unambiguous wake (e.g., pre-sleep periods) artificially raised the area under the ROC curve without improving actual nighttime wake detection. The fraction of ETC wake in a testing night correlated strongly with AUROC.
Dataset separability must be reported. The authors argue that researchers should quantify intrinsic class separability — using a simple baseline like ETC — whenever introducing a new sleep staging dataset.

Why it matters

Wearable sleep trackers are widely used by consumers and increasingly adopted in clinical research. But claims of superior performance from new algorithms are hard to evaluate without standard benchmarks. This study shows that much of what passes for algorithmic improvement could be artifacts of how evaluation is conducted.

For clinicians and researchers evaluating wearables, the practical implication is clear: look past the model name. Ask what population the algorithm was trained on, whether the test dataset includes disordered sleep, and whether the reported AUROC reflects real nighttime performance or inflation from easy-to-classify wake epochs.

Limits

The study only tested actigraphy-based accelerometry, not the multi-sensor approaches (heart rate, photoplethysmography) used by many commercial wearables. The clinical dataset was limited to sleep apnea; other sleep disorders may behave differently. ETC is a post-hoc descriptive tool, not a complete solution for fair evaluation.

Bottom line

For wearable sleep staging, dataset composition and evaluation choices explain more performance variance than model architecture. Researchers should benchmark against simple baselines and report dataset separability to avoid inflating claims.

Source

Eric Canton et al. “What matters beyond model choice for wearable sleep staging? How personalization, evaluation choices, and easy-to-classify wake impact performance.” Sleep Advances, May 26, 2026; 7(2): zpag051. DOI: 10.1093/sleepadvances/zpag051