Abstract
Topic modeling has emerged as a successful approach to uncovering topics from textual data. Various topic modeling techniques have been introduced, ranging from traditional algorithms to those based on neural networks. In this research, we explore advanced topic modeling techniques, including BERT-based approaches, to enhance the analysis of scientific articles. We first investigate a widely used Latent Dirichlet Allocation (LDA) model and then explore the capabilities of BERT, to automatically uncover latent topics within scientific papers. The goal of this study is to identify the optimal hyperparameter setting for BERT-based topic modeling of scientific articles. We conduct experiments across several scenarios involving combinations of word embedding, dimension reduction, and clustering methods. The results were analyzed based on the coherence values, average execution time, number of topics generated, visualization through the inter-topic distance map, and the top-N-words of each topic. Our findings suggest that combination of RoBERTa for word embedding, PCA for dimension reduction, and K-Means for clustering yields superior results among the tested scenarios. Further evaluation of BERT-based topic modeling is necessary to validate these findings and explore its applications in various academic and industrial contexts. The implications of these advanced techniques could significantly streamline the process of staying updated with scientific literature, potentially revolutionizing research methodologies across disciplines.
Original language | English |
---|---|
Pages (from-to) | 912-919 |
Number of pages | 8 |
Journal | International Journal on Advanced Science, Engineering and Information Technology |
Volume | 14 |
Issue number | 3 |
DOIs | |
State | Published - 2024 |
Bibliographical note
Publisher Copyright:© IJASEIT is licensed under a Creative Commons Attribution-Share Alike 4.0 International License.
Keywords
- BERT-based
- hyperparameter
- scientific articles
- topic modeling