The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. ex. Some numerals are expressed as "XNUMX".
Copyrights notice
The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. Copyrights notice
Kertas kerja ini membentangkan algoritma Gini-Index yang dipertingkatkan untuk membetulkan bias pemilihan ciri dalam klasifikasi teks. Gini-Index telah digunakan sebagai ukuran pemisahan untuk memilih atribut pemisahan yang paling sesuai dalam pepohon keputusan. Baru-baru ini, algoritma Gini-Index yang dipertingkatkan untuk pemilihan ciri, direka untuk pengkategorian teks dan berdasarkan teori Gini-Index, telah diperkenalkan, dan ia telah terbukti lebih baik daripada kaedah lain. Walau bagaimanapun, kami mendapati bahawa Gini-Index masih menunjukkan kecenderungan pemilihan ciri dalam klasifikasi teks, khususnya untuk set data tidak seimbang yang mempunyai sejumlah besar ciri. Bias pemilihan ciri bagi Indeks Gini dalam pemilihan ciri ditunjukkan dalam tiga cara: 1) nilai Gini bagi ciri frekuensi rendah adalah rendah (mengikut ukuran ketulenan) secara keseluruhan, tanpa mengira pengedaran ciri antara kelas, 2) untuk tinggi -ciri frekuensi, nilai Gini sentiasa agak tinggi dan 3) untuk ciri khusus yang tergolong dalam kelas besar, nilai Gini secara relatifnya lebih rendah daripada yang dimiliki oleh kelas kecil. Oleh itu, untuk membetulkan bias itu dan menambah baik pemilihan ciri dalam klasifikasi teks menggunakan Gini-Index, kami mencadangkan algoritma Gini-Index (I-GI) yang dipertingkatkan dengan tiga ekspresi Gini-Index yang dirumus semula. Dalam kajian ini, kami menggunakan pengurangan dimensi global (DR) dan DR tempatan untuk mengukur kebaikan ciri dalam pemilihan ciri. Dalam keputusan percubaan untuk algoritma I-GI, kami memperoleh nilai ciri tidak berat sebelah dan menghapuskan banyak ciri umum yang tidak berkaitan sambil mengekalkan banyak ciri khusus. Tambahan pula, kami boleh meningkatkan prestasi klasifikasi keseluruhan apabila kami menggunakan kaedah DR tempatan. Jumlah purata prestasi pengelasan telah meningkat sebanyak 19.4 %, 15.9 %, 3.3 %, 2.8 % dan 2.9 % (kNN) dalam Micro-F1, 14 %, 9.8 %, 9.2 %, 3.5 % dan 4.3 % (SVM) dalam Mikro-F1, 20 %, 16.9 %, 2.8 %, 3.6 % dan 3.1 % (kNN) dalam Makro-F1, 16.3 %, 14 %, 7.1 %, 4.4 %, 6.3 % (SVM) dalam Makro-F1, berbanding dengan tf*idf, χ2, Keuntungan Maklumat, Nisbah Odds dan kaedah Gini-Index sedia ada mengikut setiap pengelas.
The copyright of the original papers published on this site belongs to IEICE. Unauthorized use of the original or translated papers is prohibited. See IEICE Provisions on Copyright for details.
Salinan
Heum PARK, Hyuk-Chul KWON, "Improved Gini-Index Algorithm to Correct Feature-Selection Bias in Text Classification" in IEICE TRANSACTIONS on Information,
vol. E94-D, no. 4, pp. 855-865, April 2011, doi: 10.1587/transinf.E94.D.855.
Abstract: This paper presents an improved Gini-Index algorithm to correct feature-selection bias in text classification. Gini-Index has been used as a split measure for choosing the most appropriate splitting attribute in decision tree. Recently, an improved Gini-Index algorithm for feature selection, designed for text categorization and based on Gini-Index theory, was introduced, and it has proved to be better than the other methods. However, we found that the Gini-Index still shows a feature selection bias in text classification, specifically for unbalanced datasets having a huge number of features. The feature selection bias of the Gini-Index in feature selection is shown in three ways: 1) the Gini values of low-frequency features are low (on purity measure) overall, irrespective of the distribution of features among classes, 2) for high-frequency features, the Gini values are always relatively high and 3) for specific features belonging to large classes, the Gini values are relatively lower than those belonging to small classes. Therefore, to correct that bias and improve feature selection in text classification using Gini-Index, we propose an improved Gini-Index (I-GI) algorithm with three reformulated Gini-Index expressions. In the present study, we used global dimensionality reduction (DR) and local DR to measure the goodness of features in feature selections. In experimental results for the I-GI algorithm, we obtained unbiased feature values and eliminated many irrelevant general features while retaining many specific features. Furthermore, we could improve the overall classification performances when we used the local DR method. The total averages of the classification performance were increased by 19.4 %, 15.9 %, 3.3 %, 2.8 % and 2.9 % (kNN) in Micro-F1, 14 %, 9.8 %, 9.2 %, 3.5 % and 4.3 % (SVM) in Micro-F1, 20 %, 16.9 %, 2.8 %, 3.6 % and 3.1 % (kNN) in Macro-F1, 16.3 %, 14 %, 7.1 %, 4.4 %, 6.3 % (SVM) in Macro-F1, compared with tf*idf, χ2, Information Gain, Odds Ratio and the existing Gini-Index methods according to each classifier.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.E94.D.855/_p
Salinan
@ARTICLE{e94-d_4_855,
author={Heum PARK, Hyuk-Chul KWON, },
journal={IEICE TRANSACTIONS on Information},
title={Improved Gini-Index Algorithm to Correct Feature-Selection Bias in Text Classification},
year={2011},
volume={E94-D},
number={4},
pages={855-865},
abstract={This paper presents an improved Gini-Index algorithm to correct feature-selection bias in text classification. Gini-Index has been used as a split measure for choosing the most appropriate splitting attribute in decision tree. Recently, an improved Gini-Index algorithm for feature selection, designed for text categorization and based on Gini-Index theory, was introduced, and it has proved to be better than the other methods. However, we found that the Gini-Index still shows a feature selection bias in text classification, specifically for unbalanced datasets having a huge number of features. The feature selection bias of the Gini-Index in feature selection is shown in three ways: 1) the Gini values of low-frequency features are low (on purity measure) overall, irrespective of the distribution of features among classes, 2) for high-frequency features, the Gini values are always relatively high and 3) for specific features belonging to large classes, the Gini values are relatively lower than those belonging to small classes. Therefore, to correct that bias and improve feature selection in text classification using Gini-Index, we propose an improved Gini-Index (I-GI) algorithm with three reformulated Gini-Index expressions. In the present study, we used global dimensionality reduction (DR) and local DR to measure the goodness of features in feature selections. In experimental results for the I-GI algorithm, we obtained unbiased feature values and eliminated many irrelevant general features while retaining many specific features. Furthermore, we could improve the overall classification performances when we used the local DR method. The total averages of the classification performance were increased by 19.4 %, 15.9 %, 3.3 %, 2.8 % and 2.9 % (kNN) in Micro-F1, 14 %, 9.8 %, 9.2 %, 3.5 % and 4.3 % (SVM) in Micro-F1, 20 %, 16.9 %, 2.8 %, 3.6 % and 3.1 % (kNN) in Macro-F1, 16.3 %, 14 %, 7.1 %, 4.4 %, 6.3 % (SVM) in Macro-F1, compared with tf*idf, χ2, Information Gain, Odds Ratio and the existing Gini-Index methods according to each classifier.},
keywords={},
doi={10.1587/transinf.E94.D.855},
ISSN={1745-1361},
month={April},}
Salinan
TY - JOUR
TI - Improved Gini-Index Algorithm to Correct Feature-Selection Bias in Text Classification
T2 - IEICE TRANSACTIONS on Information
SP - 855
EP - 865
AU - Heum PARK
AU - Hyuk-Chul KWON
PY - 2011
DO - 10.1587/transinf.E94.D.855
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E94-D
IS - 4
JA - IEICE TRANSACTIONS on Information
Y1 - April 2011
AB - This paper presents an improved Gini-Index algorithm to correct feature-selection bias in text classification. Gini-Index has been used as a split measure for choosing the most appropriate splitting attribute in decision tree. Recently, an improved Gini-Index algorithm for feature selection, designed for text categorization and based on Gini-Index theory, was introduced, and it has proved to be better than the other methods. However, we found that the Gini-Index still shows a feature selection bias in text classification, specifically for unbalanced datasets having a huge number of features. The feature selection bias of the Gini-Index in feature selection is shown in three ways: 1) the Gini values of low-frequency features are low (on purity measure) overall, irrespective of the distribution of features among classes, 2) for high-frequency features, the Gini values are always relatively high and 3) for specific features belonging to large classes, the Gini values are relatively lower than those belonging to small classes. Therefore, to correct that bias and improve feature selection in text classification using Gini-Index, we propose an improved Gini-Index (I-GI) algorithm with three reformulated Gini-Index expressions. In the present study, we used global dimensionality reduction (DR) and local DR to measure the goodness of features in feature selections. In experimental results for the I-GI algorithm, we obtained unbiased feature values and eliminated many irrelevant general features while retaining many specific features. Furthermore, we could improve the overall classification performances when we used the local DR method. The total averages of the classification performance were increased by 19.4 %, 15.9 %, 3.3 %, 2.8 % and 2.9 % (kNN) in Micro-F1, 14 %, 9.8 %, 9.2 %, 3.5 % and 4.3 % (SVM) in Micro-F1, 20 %, 16.9 %, 2.8 %, 3.6 % and 3.1 % (kNN) in Macro-F1, 16.3 %, 14 %, 7.1 %, 4.4 %, 6.3 % (SVM) in Macro-F1, compared with tf*idf, χ2, Information Gain, Odds Ratio and the existing Gini-Index methods according to each classifier.
ER -