The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. ex. Some numerals are expressed as "XNUMX".
Copyrights notice
The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. Copyrights notice
Hasil pengelompokan dokumen bergantung pada skema yang digunakan untuk memberikan pemberat pada setiap istilah dalam dokumen. Walaupun kerja baru-baru ini telah cuba menggunakan pengedaran yang berkaitan dengan kelas untuk meningkatkan keupayaan diskriminasi. Perlu diterokai sama ada pendekatan sisihan atau pendekatan entropi adalah lebih berkesan. Kertas kerja ini membentangkan perbandingan antara taburan berasaskan sisihan dan taburan berasaskan entropi sebagai kekangan dalam pemberatan jangka. Di samping itu, gabungan potensi mereka disiasat untuk mencari penyelesaian optimum dalam membimbing proses pengelompokan. Dalam eksperimen, kaedah k-means berbiji digunakan untuk pengelompokan, dan prestasi pendekatan berasaskan sisihan, berasaskan entropi dan hibrid, dianalisis menggunakan dua set data teks bahasa Inggeris dan satu bahasa Thai. Keputusan menunjukkan bahawa taburan berasaskan sisihan mengatasi taburan berasaskan entropi, dan gabungan yang sesuai bagi taburan ini meningkatkan ketepatan pengelompokan sebanyak 10%.
Uraiwan BUATOOM
Thammasat University
Waree KONGPRAWECHNON
Thammasat University
Thanaruk THEERAMUNKONG
Thammasat University,The Royal Society of Thailand
The copyright of the original papers published on this site belongs to IEICE. Unauthorized use of the original or translated papers is prohibited. See IEICE Provisions on Copyright for details.
Salinan
Uraiwan BUATOOM, Waree KONGPRAWECHNON, Thanaruk THEERAMUNKONG, "Improving Seeded k-Means Clustering with Deviation- and Entropy-Based Term Weightings" in IEICE TRANSACTIONS on Information,
vol. E103-D, no. 4, pp. 748-758, April 2020, doi: 10.1587/transinf.2019IIP0017.
Abstract: The outcome of document clustering depends on the scheme used to assign a weight to each term in a document. While recent works have tried to use distributions related to class to enhance the discrimination ability. It is worth exploring whether a deviation approach or an entropy approach is more effective. This paper presents a comparison between deviation-based distribution and entropy-based distribution as constraints in term weighting. In addition, their potential combinations are investigated to find optimal solutions in guiding the clustering process. In the experiments, the seeded k-means method is used for clustering, and the performances of deviation-based, entropy-based, and hybrid approaches, are analyzed using two English and one Thai text datasets. The result showed that the deviation-based distribution outperformed the entropy-based distribution, and a suitable combination of these distributions increases the clustering accuracy by 10%.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.2019IIP0017/_p
Salinan
@ARTICLE{e103-d_4_748,
author={Uraiwan BUATOOM, Waree KONGPRAWECHNON, Thanaruk THEERAMUNKONG, },
journal={IEICE TRANSACTIONS on Information},
title={Improving Seeded k-Means Clustering with Deviation- and Entropy-Based Term Weightings},
year={2020},
volume={E103-D},
number={4},
pages={748-758},
abstract={The outcome of document clustering depends on the scheme used to assign a weight to each term in a document. While recent works have tried to use distributions related to class to enhance the discrimination ability. It is worth exploring whether a deviation approach or an entropy approach is more effective. This paper presents a comparison between deviation-based distribution and entropy-based distribution as constraints in term weighting. In addition, their potential combinations are investigated to find optimal solutions in guiding the clustering process. In the experiments, the seeded k-means method is used for clustering, and the performances of deviation-based, entropy-based, and hybrid approaches, are analyzed using two English and one Thai text datasets. The result showed that the deviation-based distribution outperformed the entropy-based distribution, and a suitable combination of these distributions increases the clustering accuracy by 10%.},
keywords={},
doi={10.1587/transinf.2019IIP0017},
ISSN={1745-1361},
month={April},}
Salinan
TY - JOUR
TI - Improving Seeded k-Means Clustering with Deviation- and Entropy-Based Term Weightings
T2 - IEICE TRANSACTIONS on Information
SP - 748
EP - 758
AU - Uraiwan BUATOOM
AU - Waree KONGPRAWECHNON
AU - Thanaruk THEERAMUNKONG
PY - 2020
DO - 10.1587/transinf.2019IIP0017
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E103-D
IS - 4
JA - IEICE TRANSACTIONS on Information
Y1 - April 2020
AB - The outcome of document clustering depends on the scheme used to assign a weight to each term in a document. While recent works have tried to use distributions related to class to enhance the discrimination ability. It is worth exploring whether a deviation approach or an entropy approach is more effective. This paper presents a comparison between deviation-based distribution and entropy-based distribution as constraints in term weighting. In addition, their potential combinations are investigated to find optimal solutions in guiding the clustering process. In the experiments, the seeded k-means method is used for clustering, and the performances of deviation-based, entropy-based, and hybrid approaches, are analyzed using two English and one Thai text datasets. The result showed that the deviation-based distribution outperformed the entropy-based distribution, and a suitable combination of these distributions increases the clustering accuracy by 10%.
ER -