TY - JOUR
T1 - Clinical implication of machine learning in predicting the occurrence of cardiovascular disease using big data (Nationwide Cohort Data in Korea)
AU - Joo, Gihun
AU - Song, Yeongjin
AU - Im, Hyeonseung
AU - Park, Junbeom
N1 - Funding Information:
This work was supported in part by the National Research Foundation of Korea (NRF) funded by the Korea Government (MSIT) under Grant NRF-2019R1F1A1063272, and in part by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT and Future Planning under Grant NRF-2017R1E1A1A01078382.
Publisher Copyright:
© 2013 IEEE.
PY - 2020
Y1 - 2020
N2 - Machine learning (ML) and large-scale big data are key factors in developing an accurate prediction model for cardiovascular disease (CVD). Although the CVD risk often depends on the race and ethnicity, most previous studies considered only US or European populations for the CVD risk prediction. In this work, to complement previous researches, we analyzed the Korean National Health Insurance Service-National Health Sample Cohort (KNHSC) data and studied the characteristics of ML and big data for predicting the CVD risk. More specifically, we assessed the effectiveness of various ML methods in predicting the 2-year and 10-year risk of CVD such as atrial fibrillation, coronary artery disease, heart failure, and strokes. To develop prediction models, we considered the usual medical examination data, questionnaire survey results, comorbidities, and past medication information available in the KNHSC data. We developed various ML-based prediction models using logistic regression, deep neural networks, random forests, and LightGBM, and validated them using various metrics such as receiver operating characteristic curves, precision-recall curves, sensitivity, specificity, and F1 score. Experimental results showed that all ML models outperformed the baseline method derived from the ACC/AHA guidelines for estimating the 10-year CVD risk, demonstrating the usefulness of ML methods. In addition, in our analysis, whether we included the past medication information as a feature or not, the prediction accuracy of all ML models was comparable to each other. Since the use of medications by the physicians provided important information on the occurrence of diseases, when we included it as a feature, all prediction models achieved a slightly higher prediction accuracy.
AB - Machine learning (ML) and large-scale big data are key factors in developing an accurate prediction model for cardiovascular disease (CVD). Although the CVD risk often depends on the race and ethnicity, most previous studies considered only US or European populations for the CVD risk prediction. In this work, to complement previous researches, we analyzed the Korean National Health Insurance Service-National Health Sample Cohort (KNHSC) data and studied the characteristics of ML and big data for predicting the CVD risk. More specifically, we assessed the effectiveness of various ML methods in predicting the 2-year and 10-year risk of CVD such as atrial fibrillation, coronary artery disease, heart failure, and strokes. To develop prediction models, we considered the usual medical examination data, questionnaire survey results, comorbidities, and past medication information available in the KNHSC data. We developed various ML-based prediction models using logistic regression, deep neural networks, random forests, and LightGBM, and validated them using various metrics such as receiver operating characteristic curves, precision-recall curves, sensitivity, specificity, and F1 score. Experimental results showed that all ML models outperformed the baseline method derived from the ACC/AHA guidelines for estimating the 10-year CVD risk, demonstrating the usefulness of ML methods. In addition, in our analysis, whether we included the past medication information as a feature or not, the prediction accuracy of all ML models was comparable to each other. Since the use of medications by the physicians provided important information on the occurrence of diseases, when we included it as a feature, all prediction models achieved a slightly higher prediction accuracy.
KW - Atrial fibrillation
KW - Korean national health insurance data
KW - cardiovascular disease
KW - coronary artery disease
KW - heart failure
KW - machine learning
KW - stroke
UR - http://www.scopus.com/inward/record.url?scp=85091235228&partnerID=8YFLogxK
U2 - 10.1109/ACCESS.2020.3015757
DO - 10.1109/ACCESS.2020.3015757
M3 - Article
AN - SCOPUS:85091235228
SN - 2169-3536
VL - 8
SP - 157643
EP - 157653
JO - IEEE Access
JF - IEEE Access
M1 - 9186081
ER -