A comparison of imputation methods using machine learning models

Heajung Suh, Jongwoo Song

Research output: Contribution to journalArticlepeer-review

2 Scopus citations

Abstract

Handling missing values in data analysis is essential in constructing a good prediction model. The easiest way to handle missing values is to use complete case data, but this can lead to information loss within the data and invalid conclusions in data analysis. Imputation is a technique that replaces missing data with alternative values obtained from information in a dataset. Conventional imputation methods include K-nearest-neighbor imputation and multiple imputations. Recent methods include missForest, missRanger, and mixgb,all which use machine learning algorithms. This paper compares the imputation techniques for datasets with mixed datatypes in various situations, such as data size, missing ratios, and missing mechanisms. To evaluate the performance of each method in mixed datasets, we propose a new imputation performance measure (IPM) that is a unified measurement applicable to numerical and categorical variables. We believe this metric can help find the best imputation method. Finally, we summarize the comparison results with imputation performances and computational times.

Original languageEnglish
Pages (from-to)331-341
Number of pages11
JournalCommunications for Statistical Applications and Methods
Volume30
Issue number3
DOIs
StatePublished - 2023

Bibliographical note

Publisher Copyright:
© 2023 The Korean Statistical Society, and Korean International Statistical Society. All rights reserved.

Keywords

  • KNN imputation
  • missForest
  • missRanger
  • mixgb
  • multiple imputation

Fingerprint

Dive into the research topics of 'A comparison of imputation methods using machine learning models'. Together they form a unique fingerprint.

Cite this