TY - JOUR
T1 - Enhancing machine learning models for total organic carbon prediction by integrating geospatial parameters in river watersheds
AU - Oh, Haeseong
AU - Park, Ho Yeon
AU - Kim, Jae In
AU - Lee, Byung Joon
AU - Choi, Jung Hyun
AU - Hur, Jin
N1 - Publisher Copyright:
© 2024
PY - 2024/9/15
Y1 - 2024/9/15
N2 - This study utilizes machine learning (ML) algorithms to develop a robust total organic carbon (TOC) prediction model for river waters in the Geumho River sub-basins, South Korea, considering both non-rain and rain events. The model incorporates geospatial parameters such as land use, slope, flow rate, and basic water quality metrics including biochemical oxygen demand (BOD), chemical oxygen demand (COD), total nitrogen (TN), total phosphorus (TP), and suspended solids (SS). A key aspect of this research is examining how land use information enhances the model's predictive accuracy. We compared two ML algorithms—extreme gradient boosting (XGBoost) and deep neural networks (DNN)—with a traditional multiple linear regression (MLR) approach. XGBoost outperformed the others, achieving an R2 value between 0.61 and 0.68 in the test dataset and demonstrating significant improvement during rain events with an R2 of 0.77 when including land use data. In contrast, this enhancement was not observed with the MLR model. Feature importance analysis using Shapley values highlighted COD as the primary predictor for non-rain events, while during rain events, COD, TP, TN, SS and agricultural land collectively influenced TOC levels. This study significantly advances understanding of TOC variability across different land use scenarios in river systems and underscores the importance of integrating geospatial and water quality parameters to enhance TOC prediction, particularly during rain events. This methodology provides a valuable framework for developing river management strategies and monitoring long-term TOC trends, especially in scenarios with gaps in essential monitoring data.
AB - This study utilizes machine learning (ML) algorithms to develop a robust total organic carbon (TOC) prediction model for river waters in the Geumho River sub-basins, South Korea, considering both non-rain and rain events. The model incorporates geospatial parameters such as land use, slope, flow rate, and basic water quality metrics including biochemical oxygen demand (BOD), chemical oxygen demand (COD), total nitrogen (TN), total phosphorus (TP), and suspended solids (SS). A key aspect of this research is examining how land use information enhances the model's predictive accuracy. We compared two ML algorithms—extreme gradient boosting (XGBoost) and deep neural networks (DNN)—with a traditional multiple linear regression (MLR) approach. XGBoost outperformed the others, achieving an R2 value between 0.61 and 0.68 in the test dataset and demonstrating significant improvement during rain events with an R2 of 0.77 when including land use data. In contrast, this enhancement was not observed with the MLR model. Feature importance analysis using Shapley values highlighted COD as the primary predictor for non-rain events, while during rain events, COD, TP, TN, SS and agricultural land collectively influenced TOC levels. This study significantly advances understanding of TOC variability across different land use scenarios in river systems and underscores the importance of integrating geospatial and water quality parameters to enhance TOC prediction, particularly during rain events. This methodology provides a valuable framework for developing river management strategies and monitoring long-term TOC trends, especially in scenarios with gaps in essential monitoring data.
KW - Feature importance
KW - Land use
KW - Machine learning
KW - Prediction model
KW - Total organic carbon
UR - http://www.scopus.com/inward/record.url?scp=85195606778&partnerID=8YFLogxK
U2 - 10.1016/j.scitotenv.2024.173743
DO - 10.1016/j.scitotenv.2024.173743
M3 - Article
C2 - 38848906
AN - SCOPUS:85195606778
SN - 0048-9697
VL - 943
JO - Science of the Total Environment
JF - Science of the Total Environment
M1 - 173743
ER -