source: arxiv machine learning: informative missingness to generate irregular clinical time series

level: research

laboratory tests in electronic health records are collected at irregular times, and the absence of a test order can carry important information about a patient's condition. this missingness reflects both clinician decisions and underlying physiology, so it should be modeled directly instead of being treated as a data cleaning problem. the work uses the dacmi benchmark, which is based on mimic-iii, to develop a generative approach that handles both the lab measurements and whether they were observed.

the method builds on the timediff framework and extends it to handle continuous lab values alongside discrete observation indicators. to keep the sampling realistic, chart times are aligned into 4-hour intervals and hospital admissions are split into 7-day windows. each resulting trajectory pairs a lab value with a flag showing if it was observed in that interval. standard normalization and transformations are applied to make training stable.

by learning the joint distribution of values and missingness patterns, the model can generate synthetic clinical time series that preserve the irregular observation structure of real data. this is useful for downstream tasks like imputation, data augmentation, and simulation studies where realistic missingness is needed. the approach treats missingness as informative rather than random, which is more aligned with how clinical data is actually generated.

why it matters: modeling informative missingness helps create more realistic synthetic health data for ai training and evaluation, improving the reliability of clinical machine learning models.


source: arxiv machine learning: informative missingness to generate irregular clinical time series