TY - JOUR
T1 - Integration of the Natural Language Processing of Structural Information Simplified Molecular-Input Line-Entry System Can Improve the In Vitro Prediction of Human Skin Sensitizers
AU - Kwon, Jae Hee
AU - Kim, Jihye
AU - Lim, Kyung Min
AU - Kim, Myeong Gyu
N1 - Publisher Copyright:
© 2024 by the authors.
PY - 2024/2
Y1 - 2024/2
N2 - Natural language processing (NLP) technology has recently used to predict substance properties based on their Simplified Molecular-Input Line-Entry System (SMILES). We aimed to develop a model predicting human skin sensitizers by integrating text features derived from SMILES with in vitro test outcomes. The dataset on SMILES, physicochemical properties, in vitro tests (DPRA, KeratinoSensTM, h-CLAT, and SENS-IS assays), and human potency categories for 122 substances sourced from the Cosmetics Europe database. The ChemBERTa model was employed to analyze the SMILES of substances. The last hidden layer embedding of ChemBERTa was tested with other features. Given the modest dataset size, we trained five XGBoost models using subsets of the training data, and subsequently employed bagging to create the final model. Notably, the features computed from SMILES played a pivotal role in the model for distinguishing sensitizers and non-sensitizers. The final model demonstrated a classification accuracy of 80% and an AUC-ROC of 0.82, effectively discriminating sensitizers from non-sensitizers. Furthermore, the model exhibited an accuracy of 82% and an AUC-ROC of 0.82 in classifying strong and weak sensitizers. In summary, we demonstrated that the integration of NLP of SMILES with in vitro test results can enhance the prediction of health hazard associated with chemicals.
AB - Natural language processing (NLP) technology has recently used to predict substance properties based on their Simplified Molecular-Input Line-Entry System (SMILES). We aimed to develop a model predicting human skin sensitizers by integrating text features derived from SMILES with in vitro test outcomes. The dataset on SMILES, physicochemical properties, in vitro tests (DPRA, KeratinoSensTM, h-CLAT, and SENS-IS assays), and human potency categories for 122 substances sourced from the Cosmetics Europe database. The ChemBERTa model was employed to analyze the SMILES of substances. The last hidden layer embedding of ChemBERTa was tested with other features. Given the modest dataset size, we trained five XGBoost models using subsets of the training data, and subsequently employed bagging to create the final model. Notably, the features computed from SMILES played a pivotal role in the model for distinguishing sensitizers and non-sensitizers. The final model demonstrated a classification accuracy of 80% and an AUC-ROC of 0.82, effectively discriminating sensitizers from non-sensitizers. Furthermore, the model exhibited an accuracy of 82% and an AUC-ROC of 0.82 in classifying strong and weak sensitizers. In summary, we demonstrated that the integration of NLP of SMILES with in vitro test results can enhance the prediction of health hazard associated with chemicals.
KW - direct peptide reactivity assay (DPRA)
KW - natural language processing
KW - QSAR
KW - SENS-IS
KW - skin sensitizer
UR - http://www.scopus.com/inward/record.url?scp=85185974187&partnerID=8YFLogxK
U2 - 10.3390/toxics12020153
DO - 10.3390/toxics12020153
M3 - Article
AN - SCOPUS:85185974187
SN - 2305-6304
VL - 12
JO - Toxics
JF - Toxics
IS - 2
M1 - 153
ER -