Does not impute! Performance and ethical implications of missing data for an ML diabetes co-morbidity predictor

Research output: Contribution to conferencePaperpeer-review

Abstract

The use of synthetic data for training and testing Machine Learning in AI systems is common to boost the size of training sets, compensate for sparse examples or introduce edge cases impossible to gather otherwise. This can be a particular issue for healthcare, for example where certain groups or presentations of conditions may be poorly represented. However, missing information may in itself be an indicator of the patient's health. In this paper we examine the impact of refraining from using data imputation on the performance of a Type 2 Diabetes (T2D) co-morbidity predictor. The training data for our predictor has missing many values in patient records, particularly those from more deprived backgrounds.

Our concern is first, that by using data imputation to compensate for the missing values we are biasing the performance and disadvantaging patients with higher deprivation. Second, missing data may indicate that the patient was too unwell to attend a clinic, therefore, it in itself is a health predictor. Common practice in the ML community is to allow 10\% of missing data to be compensated for using data synthesis, but can this be justified and what happens if we instead use incomplete data to be more representative? We performed a series of training runs with increasing amounts of targeted empty data values to assess the impact on predictor performance. We found that although there was some performance drop with 10\% of missing training values this increased much more at greater percentages. We discuss the implications of our experiments as part of a safety and ethical justification for the predictor deployment and the choice of model, noting the complex trade-offs this may require.
Original languageEnglish
Publication statusPublished - 2025
Event8th International Workshop of Artificial Intelligence and Safety Engineering - Stockholm
Duration: 9 Sept 2025 → …
https://www.waise.org/

Workshop

Workshop8th International Workshop of Artificial Intelligence and Safety Engineering
Abbreviated titleWAISE
Period9/09/25 → …
Internet address

Cite this