The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. ex. Some numerals are expressed as "XNUMX".
Copyrights notice
The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. Copyrights notice
Dalam bahasa Melayu, tiada konjugasi dan deklinasi serta imbuhan mempunyai fungsi tatabahasa yang penting. Dalam bahasa Melayu, perkataan yang sama boleh berfungsi sebagai kata nama, kata adjektif, kata keterangan, atau, kata kerja, bergantung pada kedudukannya dalam ayat. Walaupun kata dasar yang sangat mudah digunakan dalam perbualan tidak formal, adalah penting untuk menggunakan perkataan yang tepat dalam ucapan formal atau teks bertulis. Dalam bahasa Melayu, untuk menjelaskan ayat, kata terbitan digunakan. Derivasi dicapai terutamanya dengan penggunaan imbuhan. Terdapat kira-kira seratus kemungkinan bentuk terbitan kata dasar dalam bahasa bertulis bahasa Melayu terpelajar. Oleh itu, susunan perkataan bahasa Melayu mungkin rumit. Walaupun terdapat beberapa jenis algoritma stemming yang tersedia untuk pemprosesan teks dalam bahasa Inggeris dan beberapa bahasa lain, ia tidak boleh digunakan untuk mengatasi kesukaran dalam stemming perkataan Melayu. Stemming ialah proses mengurangkan pelbagai perkataan kepada bentuk akarnya untuk meningkatkan keberkesanan pemprosesan teks dalam sistem maklumat. Adalah penting untuk mengelakkan ralat over-stemming dan under-stemming. Kami telah membangunkan stemmer bahasa Melayu baharu (algoritma stemming) untuk mengalih keluar imbuhan infleksi dan terbitan. Stemmer kami menggunakan satu set peraturan imbuhan dan dua jenis kamus: kamus kata dasar dan kamus kata terbitan. Penggunaan set peraturan adalah bertujuan untuk mengurangkan berlakunya kesilapan under-stemming, manakala kamus pula dipercayai dapat mengurangkan berlakunya over-stemming. Kami melakukan eksperimen untuk menilai aplikasi stemmer kami dalam perisian perlombongan teks. Untuk eksperimen, data teks yang digunakan adalah halaman web sebenar yang dikumpul daripada World Wide Web untuk menunjukkan keberkesanan algoritma stemming bahasa Melayu kami. Keputusan eksperimen menunjukkan bahawa stemmer kami boleh meningkatkan ketepatan ungkapan Boolean yang diekstrak untuk pengkategorian teks dengan berkesan.
The copyright of the original papers published on this site belongs to IEICE. Unauthorized use of the original or translated papers is prohibited. See IEICE Provisions on Copyright for details.
Salinan
Michiko YASUKAWA, Hui Tian LIM, Hidetoshi YOKOO, "Stemming Malay Text and Its Application in Automatic Text Categorization" in IEICE TRANSACTIONS on Information,
vol. E92-D, no. 12, pp. 2351-2359, December 2009, doi: 10.1587/transinf.E92.D.2351.
Abstract: In Malay language, there are no conjugations and declensions and affixes have important grammatical functions. In Malay, the same word may function as a noun, an adjective, an adverb, or, a verb, depending on its position in the sentence. Although extensively simple root words are used in informal conversations, it is essential to use the precise words in formal speech or written texts. In Malay, to make sentences clear, derivative words are used. Derivation is achieved mainly by the use of affixes. There are approximately a hundred possible derivative forms of a root word in written language of the educated Malay. Therefore, the composition of Malay words may be complicated. Although there are several types of stemming algorithms available for text processing in English and some other languages, they cannot be used to overcome the difficulties in Malay word stemming. Stemming is the process of reducing various words to their root forms in order to improve the effectiveness of text processing in information systems. It is essential to avoid both over-stemming and under-stemming errors. We have developed a new Malay stemmer (stemming algorithm) for removing inflectional and derivational affixes. Our stemmer uses a set of affix rules and two types of dictionaries: a root-word dictionary and a derivative-word dictionary. The use of set of rules is aimed at reducing the occurrence of under-stemming errors, while that of the dictionaries is believed to reduce the occurrence of over-stemming errors. We performed an experiment to evaluate the application of our stemmer in text mining software. For the experiment, text data used were actual web pages collected from the World Wide Web to demonstrate the effectiveness of our Malay stemming algorithm. The experimental results showed that our stemmer can effectively increase the precision of the extracted Boolean expressions for text categorization.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.E92.D.2351/_p
Salinan
@ARTICLE{e92-d_12_2351,
author={Michiko YASUKAWA, Hui Tian LIM, Hidetoshi YOKOO, },
journal={IEICE TRANSACTIONS on Information},
title={Stemming Malay Text and Its Application in Automatic Text Categorization},
year={2009},
volume={E92-D},
number={12},
pages={2351-2359},
abstract={In Malay language, there are no conjugations and declensions and affixes have important grammatical functions. In Malay, the same word may function as a noun, an adjective, an adverb, or, a verb, depending on its position in the sentence. Although extensively simple root words are used in informal conversations, it is essential to use the precise words in formal speech or written texts. In Malay, to make sentences clear, derivative words are used. Derivation is achieved mainly by the use of affixes. There are approximately a hundred possible derivative forms of a root word in written language of the educated Malay. Therefore, the composition of Malay words may be complicated. Although there are several types of stemming algorithms available for text processing in English and some other languages, they cannot be used to overcome the difficulties in Malay word stemming. Stemming is the process of reducing various words to their root forms in order to improve the effectiveness of text processing in information systems. It is essential to avoid both over-stemming and under-stemming errors. We have developed a new Malay stemmer (stemming algorithm) for removing inflectional and derivational affixes. Our stemmer uses a set of affix rules and two types of dictionaries: a root-word dictionary and a derivative-word dictionary. The use of set of rules is aimed at reducing the occurrence of under-stemming errors, while that of the dictionaries is believed to reduce the occurrence of over-stemming errors. We performed an experiment to evaluate the application of our stemmer in text mining software. For the experiment, text data used were actual web pages collected from the World Wide Web to demonstrate the effectiveness of our Malay stemming algorithm. The experimental results showed that our stemmer can effectively increase the precision of the extracted Boolean expressions for text categorization.},
keywords={},
doi={10.1587/transinf.E92.D.2351},
ISSN={1745-1361},
month={December},}
Salinan
TY - JOUR
TI - Stemming Malay Text and Its Application in Automatic Text Categorization
T2 - IEICE TRANSACTIONS on Information
SP - 2351
EP - 2359
AU - Michiko YASUKAWA
AU - Hui Tian LIM
AU - Hidetoshi YOKOO
PY - 2009
DO - 10.1587/transinf.E92.D.2351
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E92-D
IS - 12
JA - IEICE TRANSACTIONS on Information
Y1 - December 2009
AB - In Malay language, there are no conjugations and declensions and affixes have important grammatical functions. In Malay, the same word may function as a noun, an adjective, an adverb, or, a verb, depending on its position in the sentence. Although extensively simple root words are used in informal conversations, it is essential to use the precise words in formal speech or written texts. In Malay, to make sentences clear, derivative words are used. Derivation is achieved mainly by the use of affixes. There are approximately a hundred possible derivative forms of a root word in written language of the educated Malay. Therefore, the composition of Malay words may be complicated. Although there are several types of stemming algorithms available for text processing in English and some other languages, they cannot be used to overcome the difficulties in Malay word stemming. Stemming is the process of reducing various words to their root forms in order to improve the effectiveness of text processing in information systems. It is essential to avoid both over-stemming and under-stemming errors. We have developed a new Malay stemmer (stemming algorithm) for removing inflectional and derivational affixes. Our stemmer uses a set of affix rules and two types of dictionaries: a root-word dictionary and a derivative-word dictionary. The use of set of rules is aimed at reducing the occurrence of under-stemming errors, while that of the dictionaries is believed to reduce the occurrence of over-stemming errors. We performed an experiment to evaluate the application of our stemmer in text mining software. For the experiment, text data used were actual web pages collected from the World Wide Web to demonstrate the effectiveness of our Malay stemming algorithm. The experimental results showed that our stemmer can effectively increase the precision of the extracted Boolean expressions for text categorization.
ER -