TY - JOUR
T1 - Text mining in MOF research
T2 - from manual curation to large language model-based automation
AU - Bae, Suyeon
AU - Jeon, Mingyu
AU - Moon, Hoi Ri
N1 - Publisher Copyright:
© 2025 The Royal Society of Chemistry.
PY - 2025/7/22
Y1 - 2025/7/22
N2 - The rapid expansion of metal-organic framework (MOF) literature presents both a rich resource and a significant challenge for knowledge extraction. Text mining, which enables the conversion of unstructured scientific texts into structured, machine-readable data, has emerged as a key tool for accelerating data-driven research in the MOF domain. This review traces the development of text mining approaches in MOF research, from early manual curation and rule-based methods to recent breakthroughs powered by large language model (LLM)-based automation. We discuss the foundational role of natural language processing (NLP) and machine learning (ML) techniques such as named entity recognition and vector embedding models, followed by an in-depth analysis of LLM-based frameworks that enable flexible, scalable, and context-aware information extraction. Additionally, we introduce and compare their accuracy, and explore their diverse applications—including prediction of synthesizability, materials properties, and thermal stability. We conclude with a perspective on future directions for text mining in MOF research, including its integration into interactive graphical user interfaces, autonomous laboratories, multi-agent AI systems, and multi-modal LLM frameworks that can process textual, visual, and structural information in a unified way. This review aims to provide a foundational understanding for both experimental and computational researchers interested in adopting or advancing text mining methods in the MOF field.
AB - The rapid expansion of metal-organic framework (MOF) literature presents both a rich resource and a significant challenge for knowledge extraction. Text mining, which enables the conversion of unstructured scientific texts into structured, machine-readable data, has emerged as a key tool for accelerating data-driven research in the MOF domain. This review traces the development of text mining approaches in MOF research, from early manual curation and rule-based methods to recent breakthroughs powered by large language model (LLM)-based automation. We discuss the foundational role of natural language processing (NLP) and machine learning (ML) techniques such as named entity recognition and vector embedding models, followed by an in-depth analysis of LLM-based frameworks that enable flexible, scalable, and context-aware information extraction. Additionally, we introduce and compare their accuracy, and explore their diverse applications—including prediction of synthesizability, materials properties, and thermal stability. We conclude with a perspective on future directions for text mining in MOF research, including its integration into interactive graphical user interfaces, autonomous laboratories, multi-agent AI systems, and multi-modal LLM frameworks that can process textual, visual, and structural information in a unified way. This review aims to provide a foundational understanding for both experimental and computational researchers interested in adopting or advancing text mining methods in the MOF field.
UR - https://www.scopus.com/pages/publications/105009932119
U2 - 10.1039/d5cc02511g
DO - 10.1039/d5cc02511g
M3 - Review article
C2 - 40613389
AN - SCOPUS:105009932119
SN - 1359-7345
VL - 61
SP - 11083
EP - 11094
JO - Chemical Communications
JF - Chemical Communications
IS - 60
ER -