Clinical implication of machine learning in predicting the occurrence of cardiovascular disease using big data (Nationwide Cohort Data in Korea)

Gihun Joo, Yeongjin Song, Hyeonseung Im, Junbeom Park

Research output: Contribution to journalArticlepeer-review

18 Scopus citations


Machine learning (ML) and large-scale big data are key factors in developing an accurate prediction model for cardiovascular disease (CVD). Although the CVD risk often depends on the race and ethnicity, most previous studies considered only US or European populations for the CVD risk prediction. In this work, to complement previous researches, we analyzed the Korean National Health Insurance Service-National Health Sample Cohort (KNHSC) data and studied the characteristics of ML and big data for predicting the CVD risk. More specifically, we assessed the effectiveness of various ML methods in predicting the 2-year and 10-year risk of CVD such as atrial fibrillation, coronary artery disease, heart failure, and strokes. To develop prediction models, we considered the usual medical examination data, questionnaire survey results, comorbidities, and past medication information available in the KNHSC data. We developed various ML-based prediction models using logistic regression, deep neural networks, random forests, and LightGBM, and validated them using various metrics such as receiver operating characteristic curves, precision-recall curves, sensitivity, specificity, and F1 score. Experimental results showed that all ML models outperformed the baseline method derived from the ACC/AHA guidelines for estimating the 10-year CVD risk, demonstrating the usefulness of ML methods. In addition, in our analysis, whether we included the past medication information as a feature or not, the prediction accuracy of all ML models was comparable to each other. Since the use of medications by the physicians provided important information on the occurrence of diseases, when we included it as a feature, all prediction models achieved a slightly higher prediction accuracy.

Original languageEnglish
Article number9186081
Pages (from-to)157643-157653
Number of pages11
JournalIEEE Access
StatePublished - 2020


  • Atrial fibrillation
  • Korean national health insurance data
  • cardiovascular disease
  • coronary artery disease
  • heart failure
  • machine learning
  • stroke


Dive into the research topics of 'Clinical implication of machine learning in predicting the occurrence of cardiovascular disease using big data (Nationwide Cohort Data in Korea)'. Together they form a unique fingerprint.

Cite this