Categorization of Malay Social Media Text and Normalization of Spelling Variations and Vowel-less Words

Ruhaila Maskat; Nurazzah Abdul Rahman

doi:10.18517/ijaseit.10.4.10237

Categorization of Malay Social Media Text and Normalization of Spelling Variations and Vowel-less Words

Ruhaila Maskat, Nurazzah Abdul Rahman

Abstract

As more data are being introduced, it brings along with it missing values, inconsistencies, and heterogeneities, or so-called unclean aspects. Text analytics relies on clean data to produce reliable results. Pre-processing is an essential phase in text analytics, specifically language detection and normalization. The problem with conducting text analytics on Malay social media text is how substantially it has transformed from formal Malay in terms of spelling and construction, making it difficult to process them. Recent advances have shown works to normalize yet cherry-picked specific types of Malay social media text where their descriptions were listed in simple and narrow categorizations. A formal categorization is necessary to provide significant description of the different patterns of Malay social media text, allowing the selection of suitable methods in handling them. In this paper, we propose an inexhaustive formal categorization for Malay social media text based on inherent nature. We refer to them as Social Media Malay Language (SMML) to differentiate them from the standard Malay language. They are spelling variations, Malay-English mix sentences, loan words/phrases, slang-based words, and vowel-less words. Also, in this work, we conducted a normalization on two of the SMML categories, spelling variations, and vowel-less words, using two similarity matching techniques (i.e., nGram Tversky Index and Levenshtein). Our result shows that similarity-matching techniques can detect both categories, but a more sophisticated technique is necessary to improve the precision score. The normalization of the rest of the categories is extensive research works.

Keywords

text analytics; social media; data pre-processing; normalization; malay language.

Full Text:

PDF

References

E. Haddi, X. Liu, and Y. Shi, â€œThe role of text pre-processing in sentiment analysis,â€ Procedia Comput. Sci., vol. 17, pp. 26â€“32, 2013.

L. Derczynski, D. Maynard, N. Aswani, and K. Bontcheva, â€œMicroblog-genre noise and impact on semantic annotation accuracy,â€ in Proceedings of the 24th ACM Conference on Hypertext and Social Media, 2013, pp. 21â€“30.

â€œMalay Language,â€ Encyclopedia Britannica. [Online]. Available: https://www.britannica.com/topic/Malay-language.

Statista, â€œNumber of Facebook users in Malaysia from 2017 to 2023.â€ [Online]. Available: https://www.statista.com/statistics/490484/number-of-malaysia-facebook-users/ .

N. Elgendy and A. Elragal, â€œBig data analytics: a literature review paper,â€ in Industrial Conference on Data Mining, 2014, pp. 214â€“227.

R. Kitchin, â€œThe real-time city? Big data and smart urbanism,â€ GeoJournal, vol. 79, no. 1, pp. 1â€“14, 2014.

X. Hu and H. Liu, â€œText analytics in social media,â€ in Mining text data, Springer, 2012, pp. 385â€“414.

N. N. Yusof, A. Mohamed, and S. Abdul-Rahman, â€œReviewing classification approaches in sentiment analysis,â€ in International conference on soft computing in data science, 2015, pp. 43â€“53.

S. Abdul-Rahman, A. A. Bakar, and Z.-A. Mohamed-Hussein, â€œAn intelligent data pre-processing of complex datasets,â€ Intell. Data Anal., vol. 16, no. 2, pp. 305â€“325, 2012.

S. B. Rodzman, M. F. I. A. Ronie, N. K. Ismail, N. A. Rahman, F. Ahmad, and Z. M. Nor, â€œAnalyzing Malay Stemmer Performance Towards Fuzzy Logic Ranking Function on Malay Text Corpus,â€ in 2018 Fourth International Conference on Information Retrieval and Knowledge Management (CAMP), 2018, pp. 1â€“6.

I. Balazevic, M. Braun, and K.-R. MÃ¼ller, â€œLanguage Detection For Short Text Messages In Social Media,â€ arXiv Prepr. arXiv1608.08515, 2016.

M. Lui and T. Baldwin, â€œAccurate language identification of twitter messages,â€ in Proceedings of the 5th workshop on language analysis for social media (LASM), 2014, pp. 17â€“25.

â€œLoanword,â€ Lexico. [Online]. Available: https://en.oxforddictionaries.com/definition/loanword.

S. B. Basri, R. Alfred, and C. K. On, â€œAutomatic spell checker for Malay blog,â€ in 2012 IEEE International Conference on Control System, Computing and Engineering, 2012, pp. 506â€“510.

N. Samsudin, M. Puteh, A. R. Hamdan, and M. Z. A. Nazri, â€œNormalization of noisy texts in Malaysian online reviews,â€ J. ICT, vol. 12, pp. 147â€“159, 2013.

M. A. Saloot, N. Idris, and A. Aw, â€œNoisy text normalization using an enhanced language model,â€ in Proceedings of the International Conference on Artificial Intelligence and Pattern Recognition, 2014, pp. 111â€“122.

N. A. B. Muhamad, N. Idris, and M. A. Saloot, â€œProposal: A Hybrid Dictionary Modelling Approach for Malay Tweet Normalization,â€ in Journal of Physics: Conference Series, 2017, vol. 806, no. 1, p. 12008.

M. A. Saloot, N. Idris, and R. Mahmud, â€œAn architecture for Malay Tweet normalization,â€ Inf. Process. Manag., vol. 50, no. 5, pp. 621â€“633, 2014.

â€œPanduan singkatan khidmat pesanan ringkas,â€ Dewan Bahasa dan Pustaka. [Online]. Available: http://www.dbp.gov.my/khidmatsms.pdf.

R.-M. Bali and N. P. Kuan, â€œLanguage Identifier for Bahasa Malaysia and Bahasa Indonesia.â€

J. Williams and C. Dagli, â€œTwitter language identification of similar languages and dialects without ground truth,â€ in Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), 2017, pp. 73â€“83.

M. Puteh, N. Isa, S. Puteh, and N. A. Redzuan, â€œSentiment mining of Malay newspaper (SAMNews) using artificial immune system,â€ in Proceedings of the World Congress on Engineering, 2013, vol. 3, pp. 1498â€“1503.

A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios, â€œDuplicate record detection: A survey,â€ IEEE Trans. Knowl. Data Eng., vol. 19, no. 1, pp. 1â€“16, 2006.

A. Tversky, â€œFeatures of similarity.,â€ Psychol. Rev., vol. 84, no. 4, p. 327, 1977.

L. Yujian and L. Bo, â€œA normalized Levenshtein distance metric,â€ IEEE Trans. Pattern Anal. Mach. Intell., vol. 29, no. 6, pp. 1091â€“1095, 2007.

DOI: http://dx.doi.org/10.18517/ijaseit.10.4.10237

Refbacks

There are currently no refbacks.

Published by INSIGHT - Indonesian Society for Knowledge and Human Development

International Journal on Advanced Science, Engineering and Information Technology

Categorization of Malay Social Media Text and Normalization of Spelling Variations and Vowel-less Words

Abstract

Keywords

Full Text:

References

Refbacks