The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. ex. Some numerals are expressed as "XNUMX".
Copyrights notice
The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. Copyrights notice
Artifak perisian melombong ialah cara yang berguna untuk memahami kod sumber projek perisian. Pemodelan topik khususnya telah digunakan secara meluas untuk menemui maklumat yang bermakna daripada artifak perisian. Walau bagaimanapun, artifak perisian tidak berstruktur dan mengandungi campuran jenis teks dalam teks semula jadi. Ciri artifak perisian ini memburukkan prestasi pemodelan topik. Antara beberapa tugas pra-pemprosesan bahasa semula jadi, mengalih keluar kata henti untuk mengurangkan istilah yang tidak bermakna dan tidak menarik ialah cara yang cekap untuk meningkatkan kualiti model topik. Walaupun banyak pendekatan digunakan untuk menjana kata henti yang berkesan, senarai tersebut sudah lapuk atau terlalu umum untuk digunakan pada artifak perisian perlombongan. Selain itu, prestasi model topik adalah sensitif kepada set data yang digunakan dalam latihan untuk setiap pendekatan. Untuk menyelesaikan masalah ini, kami mencadangkan pendekatan penjanaan kata henti automatik untuk model topik artifak perisian. Dengan mengukur keselarasan topik antara perkataan dalam topik menggunakan Maklumat Bersama Pointwise (PMI), kami menambahkan perkataan dengan skor PMI yang rendah pada senarai kata henti kami untuk setiap gelung pemodelan topik. Melalui percubaan kami, kami membuktikan bahawa senarai kata henti kami menghasilkan prestasi model topik yang lebih tinggi daripada senarai daripada pendekatan lain.
Jung-Been LEE
Korea University
Taek LEE
Sungshin University
Hoh Peter IN
Korea University
The copyright of the original papers published on this site belongs to IEICE. Unauthorized use of the original or translated papers is prohibited. See IEICE Provisions on Copyright for details.
Salinan
Jung-Been LEE, Taek LEE, Hoh Peter IN, "Automatic Stop Word Generation for Mining Software Artifact Using Topic Model with Pointwise Mutual Information" in IEICE TRANSACTIONS on Information,
vol. E102-D, no. 9, pp. 1761-1772, September 2019, doi: 10.1587/transinf.2018EDP7390.
Abstract: Mining software artifacts is a useful way to understand the source code of software projects. Topic modeling in particular has been widely used to discover meaningful information from software artifacts. However, software artifacts are unstructured and contain a mix of textual types within the natural text. These software artifact characteristics worsen the performance of topic modeling. Among several natural language pre-processing tasks, removing stop words to reduce meaningless and uninteresting terms is an efficient way to improve the quality of topic models. Although many approaches are used to generate effective stop words, the lists are outdated or too general to apply to mining software artifacts. In addition, the performance of the topic model is sensitive to the datasets used in the training for each approach. To resolve these problems, we propose an automatic stop word generation approach for topic models of software artifacts. By measuring topic coherence among words in the topic using Pointwise Mutual Information (PMI), we added words with a low PMI score to our stop words list for every topic modeling loop. Through our experiment, we proved that our stop words list results in a higher performance of the topic model than lists from other approaches.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.2018EDP7390/_p
Salinan
@ARTICLE{e102-d_9_1761,
author={Jung-Been LEE, Taek LEE, Hoh Peter IN, },
journal={IEICE TRANSACTIONS on Information},
title={Automatic Stop Word Generation for Mining Software Artifact Using Topic Model with Pointwise Mutual Information},
year={2019},
volume={E102-D},
number={9},
pages={1761-1772},
abstract={Mining software artifacts is a useful way to understand the source code of software projects. Topic modeling in particular has been widely used to discover meaningful information from software artifacts. However, software artifacts are unstructured and contain a mix of textual types within the natural text. These software artifact characteristics worsen the performance of topic modeling. Among several natural language pre-processing tasks, removing stop words to reduce meaningless and uninteresting terms is an efficient way to improve the quality of topic models. Although many approaches are used to generate effective stop words, the lists are outdated or too general to apply to mining software artifacts. In addition, the performance of the topic model is sensitive to the datasets used in the training for each approach. To resolve these problems, we propose an automatic stop word generation approach for topic models of software artifacts. By measuring topic coherence among words in the topic using Pointwise Mutual Information (PMI), we added words with a low PMI score to our stop words list for every topic modeling loop. Through our experiment, we proved that our stop words list results in a higher performance of the topic model than lists from other approaches.},
keywords={},
doi={10.1587/transinf.2018EDP7390},
ISSN={1745-1361},
month={September},}
Salinan
TY - JOUR
TI - Automatic Stop Word Generation for Mining Software Artifact Using Topic Model with Pointwise Mutual Information
T2 - IEICE TRANSACTIONS on Information
SP - 1761
EP - 1772
AU - Jung-Been LEE
AU - Taek LEE
AU - Hoh Peter IN
PY - 2019
DO - 10.1587/transinf.2018EDP7390
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E102-D
IS - 9
JA - IEICE TRANSACTIONS on Information
Y1 - September 2019
AB - Mining software artifacts is a useful way to understand the source code of software projects. Topic modeling in particular has been widely used to discover meaningful information from software artifacts. However, software artifacts are unstructured and contain a mix of textual types within the natural text. These software artifact characteristics worsen the performance of topic modeling. Among several natural language pre-processing tasks, removing stop words to reduce meaningless and uninteresting terms is an efficient way to improve the quality of topic models. Although many approaches are used to generate effective stop words, the lists are outdated or too general to apply to mining software artifacts. In addition, the performance of the topic model is sensitive to the datasets used in the training for each approach. To resolve these problems, we propose an automatic stop word generation approach for topic models of software artifacts. By measuring topic coherence among words in the topic using Pointwise Mutual Information (PMI), we added words with a low PMI score to our stop words list for every topic modeling loop. Through our experiment, we proved that our stop words list results in a higher performance of the topic model than lists from other approaches.
ER -