teknik stemming bahasa melayu berasaskan

1

Upload: duonghanh

Post on 11-Jan-2017

250 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: TEKNIK STEMMING BAHASA MELAYU BERASASKAN

I

TEKNIK STEMMING BAHASA MELAYU BERASASKAN

ALGORITMA PORTER

ANIZAH BINTI SAMSUDIN

Laporan ini dikemukakan sebagai memenuhi sebahagian daripada syarat penganugerahan

Ijazah Sarjana Muda Sains Komputer

FAKULTI SAINS KOMPUTER DAN SISTEM MAKLUMAT

UNIVERSITI TEKNOLOGI MALAYSIA

OKTOBER,2003

Page 2: TEKNIK STEMMING BAHASA MELAYU BERASASKAN

ABSTRAK

f

Stemmingmerupakan satu proses yang dilaksanakan untuk mendapatkan kata akar

bagi sesuatu perkataan. Stemming merupakan satu teknik yang banyak digunakan untuk

carian maklumat terutamanya dalam bidang Information Retrieval (IR). Banyak

algorifna yang telah dibangunkan untuk proses stemming ini bagi mendapatkan

ketepatan maklumat yang diperlukan. Kebanyakan stemmingyang digunakan kini adalah

untuk Bahasa Inggeris memandangkan penggmaan Bahasa Inggeris yang meluas

terutamanya dalam sistem carian di internet. Stemming bagi Bahasa Melayu masih belum

banyak digunakan dan masih belum belum banyak teknik stemmingyang diketahui

umum. Hanya segelintir sahaja kajian stemming unhrk Bahasa Melayu yang dapat

dikenalpasti. Kajian ini akan melaksanakan stemming ke atas perkataan Bahasa Melayu

yangberimbuhan iaitu bagr imbuhan awalan, imbuhan akhiran dan juga imbuhan apitan.

Objektif kEian bagi projek ini adalah mengkaji nahu Bahasa Melayu, mengkaji

keupayaan Algorifrna Porter dalam melaksanakan stemming, menghasilkan peraturan

untuk Bahasa Melayu berdasarkan Algoritma Porter dan menguji serta mefirperbaiki

kelemahan yang akan timbul. Metodologi pembangunan teknik stemming ini terdiri

daripada enam fasa iaitu fasa pernurlaan, fasa kajian nahu Bahasa Melayu, fasa kajian

Algoriuna Porter, fasa penghasilan peraturan, fasa pengkodan dan pelaksanaan dan fasa

pengujian. Hasil daripada kajian dan pelaksanaan projek II ini ialah satu teknik stemming

bagi perkataan Bahasa Melayu berasaskan kepada Algoritna Porter. Teknik ini telah

berjaya menghasilkan kata akar yang tepat bagi sebahagian besar bentuk perkataan

terbitan yang diporolehi dari Kamus Dewan Bahasa dan Pustaka dan juga dapat

menyelesaikan masalah understemming.

Page 3: TEKNIK STEMMING BAHASA MELAYU BERASASKAN

VI

ABSTRACT

t

Stemming is a morphological process of normalizing word tokens down to their

essential roots. Stemming widely used in information retrieval system. A stemming

algorithm is a computational procedure which has the ability to reduce all words with the

sarne root to common form. There wore many types of stemming algorithms had

developed to increase the performance in order to get the information effectively.

Stemming process are widely use in English language because English is an important

language in searching systom. Stemming for Malay language not yet widely implemented

and there are just a few commonly known techniques. On the few research of stemming

technique in Malay words were detected. This rosearch implement the stemming for

Malay words which strips prefix and suffix off the word. This study consists of four main

objectives which are study the grammar of Malay words, study the capability of Porter

Algorithm, produce the specific rules for Malay words according to Porter Algorithm and

test and improve problem that had been surface. There are six phases methodology to

produce this technique that are beginning with initial phase, then study the Malay words

grammar, study the capability of Porter Algorithm, produce the specific rules for Malay

words, then implement the technique into a program coding and finally test and improve

the technique to get the high performance. The output is a stemming technique for Malay

words base on the Porter Algorithm. The performance of this Malay stemming algorithm

was tested using the test collection of words which exhacted from the Dewan Bahasa dan

Pustaka dictionary. The results of this study show that the algorithm has successfirlly

stemmed almost all of affixes Malay words and also successful to overcome the

understemming error in this project.

Page 4: TEKNIK STEMMING BAHASA MELAYU BERASASKAN

99

RUJUKAN

Tai, S. Y. dan Ong, C. S (2000). On Designing an Automated MalaysianSte,nrmer for the Malay Language. ACM. 207 -208.

Abdul Rahman Talib"(2000). Pedagogr Bahass Melayu: Prinsip Kaedah dan Teknik.

Edisi Pertama. Utusan Publication & Distributors Sdn Bhd.

Allen, J.( I 987). Natural Language unde rstandi ng. The Benjamin/cummings

Publi shing Company,lnc.

othman (1993). Pengakar Perkataan Meloyu untuk sistem capaian

Dolwmen. Universiti Kebangsaan Malaysia : Tesis Master.

Al-Kharashi, I.A dan Evens, M.w. (1994). comparing words, sterns, And Roots AsIndex Temrs In Arabic Information Retrieval System. Journal Of AmericanSociety For Information Science.45 (8): 54S- 560.

cay, s.H dan Gary,c (1999). core Java volume I Fundamental. swMicrosystemsPress: A prentice Hall Title.

Dawson, J. (1974). suffix Removal And word conflation. Bulletin of theAsso c iati on for Li terory & Lingui sti c C omptting. 2 (3): 33 -46.

Atikah said (1998) . study on stemming Algorithm For Malay words startingwith Alphabet 'E', 'F' and ../ ' . Universiti Teknologi Mara: Tesis smjana Muda.

Fatimah Alrmad (1995). A Malay Language Document Retrieval system An

Experiment Approach And Analysis. Universiti Kebangsaan Malaysia : TesisPh. D.

Page 5: TEKNIK STEMMING BAHASA MELAYU BERASASKAN

100

Ahmad, Muhammad Yussof dan TEngku M. T. Sembok (1996)"

Experiment with a stemrning algorithm for malay words. Joumal of the

American Societyfor Information Science- 47 (12):909 - 918.

ij, W. dan Pohlmann, R.(1996). Viewing Stemming as recall enhancemeot.ln

Proceedings of ACM SIGIR96. pp.40-48.

R. (1993). Viewing Morphology as an Inference Process. Proc. rcn ACM

S/G/R C onfe renc e. 191 -202.

,L. S., Ballesteros, L. and Connell, M. E. (2001). Improving Stemming for

Arabic Information Retrieval : Light Stemming and Co-occurrence Analysis.

In TREC 2001. Gaithersburg : NIST.

J.B. (1968). Development of a Sternming Algorithm. Mechanical

Translati on and Computati onal Lingui sti c. ll, 22-23

Nazrul Che Mahmud (2000). To Improve Stemming Algorithm For Malay

Words Starttng With The Letter 'I'. Universiti Teknologi Mara: Tesis Sarjana

Muda.

Safiah Karim, Farid M. Onn, Hashim Hj. Musa dan Abdul Hamid Mahmood

(1995). Tatabahasa Dewan. Edisi Ketiga. Malaysia : Dewan Bahasa dan

Pustaka.

Paice, C.D.(1990). Another Stemmer. In SIGIR 90,56-6I.

Popovic, M. and Willett, P. (1992). The Effectiveness of Stemming for

Natnral-Language Access to Slovene Textual Da1.a. Journal of the

Am eri can So ci e ty fo r Inform ati on S c i e nc e, 43 (5 ): 3 84 -3 90.

Porter, M. F. (1980). An Algorithm for suffix stripping Program,14:130-137 .

t

Page 6: TEKNIK STEMMING BAHASA MELAYU BERASASKAN

l0l

in (1999). Study Of &6)mn1 Algorithm For Malay Words

Startingwith Alphabet'Z'. Universiti Teknologi Mara: Tesis Sarjana Muda.

(1993). Effectiveness of inforrration retrieval system used in a hypertext

Hypermedia, 5:23-46.

Sheilr*r Satim (1991). Kqmus Dewan. Edisi Kedua. Malaysia: Dian

Bahasa Dan Pustaka.

Abu Bakar (1999). Evaluation Of Retrieval Effictiveness Of Confletion Methods

On Malay Documents. Universiti Kebangsaan Malaysia: Tesis Doktor Falsafah.

http :{maya. cs. depapUedu

http ://www. cs jmu. qdr/commqn/proj ect/stem8ing/porter

http ://www. cwa.md,xgc.uk/phris/search/stemmer. doc