Abstract
Handling missing values in data analysis is essential in constructing a good prediction model. The easiest way to handle missing values is to use complete case data, but this can lead to information loss within the data and invalid conclusions in data analysis. Imputation is a technique that replaces missing data with alternative values obtained from information in a dataset. Conventional imputation methods include K-nearest-neighbor imputation and multiple imputations. Recent methods include missForest, missRanger, and mixgb,all which use machine learning algorithms. This paper compares the imputation techniques for datasets with mixed datatypes in various situations, such as data size, missing ratios, and missing mechanisms. To evaluate the performance of each method in mixed datasets, we propose a new imputation performance measure (IPM) that is a unified measurement applicable to numerical and categorical variables. We believe this metric can help find the best imputation method. Finally, we summarize the comparison results with imputation performances and computational times.
Original language | English |
---|---|
Pages (from-to) | 331-341 |
Number of pages | 11 |
Journal | Communications for Statistical Applications and Methods |
Volume | 30 |
Issue number | 3 |
DOIs | |
State | Published - 2023 |
Bibliographical note
Publisher Copyright:© 2023 The Korean Statistical Society, and Korean International Statistical Society. All rights reserved.
Keywords
- KNN imputation
- missForest
- missRanger
- mixgb
- multiple imputation