Skip to Main Content (Press Enter)

Logo UNITO
  • ×
  • Home
  • Pubblicazioni
  • Progetti
  • Persone
  • Competenze
  • Settori
  • Strutture
  • Terza Missione

UNI-FIND
Logo UNITO

|

UNI-FIND

unito.it
  • ×
  • Home
  • Pubblicazioni
  • Progetti
  • Persone
  • Competenze
  • Settori
  • Strutture
  • Terza Missione
  1. Pubblicazioni

Exploiting mutual information for the imputation of static and dynamic mixed-type clinical data with an adaptive k-nearest neighbours approach

Articolo
Data di Pubblicazione:
2020
Abstract:
Background: Clinical registers constitute an invaluable resource in the medical data-driven decision making context. Accurate machine learning and data mining approaches on these data can lead to faster diagnosis, definition of tailored interventions, and improved outcome prediction. A typical issue when implementing such approaches is the almost unavoidable presence of missing values in the collected data. In this work, we propose an imputation algorithm based on a mutual information-weighted k-nearest neighbours approach, able to handle the simultaneous presence of missing information in different types of variables. We developed and validated the method on a clinical register, constituted by the information collected over subsequent screening visits of a cohort of patients affected by amyotrophic lateral sclerosis. Methods: For each subject with missing data to be imputed, we create a feature vector constituted by the information collected over his/her first three months of visits. This vector is used as sample in a k-nearest neighbours procedure, in order to select, among the other patients, the ones with the most similar temporal evolution of the disease over time. An ad hoc similarity metric was implemented for the sample comparison, capable of handling the different nature of the data, the presence of multiple missing values and include the cross-information among features captured by the mutual information statistic. Results: We validated the proposed imputation method on an independent test set, comparing its performance with those of three state-of-the-art competitors, resulting in better performance. We further assessed the validity of our algorithm by comparing the performance of a survival classifier built on the data imputed with our method versus the one built on the data imputed with the best-performing competitor. Conclusions: Imputation of missing data is a crucial -and often mandatory- step when working with real-world datasets. The algorithm proposed in this work could effectively impute an amyotrophic lateral sclerosis clinical dataset, by handling the temporal and the mixed-type nature of the data and by exploiting the cross-information among features. We also showed how the imputation quality can affect a machine learning task.
Tipologia CRIS:
03A-Articolo su Rivista
Keywords:
Amyotrophic lateral sclerosis; Clinical datasets; Imputation; K-nearest neighbours; Missing data; Mutual information; Naïve Bayes; Amyotrophic Lateral Sclerosis; Bayes Theorem; Computational Biology; Disease; Humans; Information Storage and Retrieval; Algorithms; Data Mining; Datasets as Topic
Elenco autori:
Tavazzi E.; Daberdaku S.; Vasta R.; Calvo A.; Chio A.; Di Camillo B.
Autori di Ateneo:
CALVO Andrea
CHIO' Adriano
VASTA Rosario
Link alla scheda completa:
https://iris.unito.it/handle/2318/1776186
Link al Full Text:
https://iris.unito.it/retrieve/handle/2318/1776186/721959/BMC%20Med%20Inform%20Dic%20Making%202020%20-%20Tavazzi%20-%20imputation%20of%20data.pdf
Pubblicato in:
BMC MEDICAL INFORMATICS AND DECISION MAKING
Journal
  • Utilizzo dei cookie

Realizzato con VIVO | Designed by Cineca | 25.5.0.1