Evaluation of Average Term Occurrences Weighting Technique for Arabic Textual Information Retrieval

Belal Mustafa Abuata; Lama Ali Al Omari

doi:10.18517/ijaseit.12.6.13215

Evaluation of Average Term Occurrences Weighting Technique for Arabic Textual Information Retrieval

Belal Mustafa Abuata, Lama Ali Al Omari

Abstract

Information retrieval of documents is an important process in the current time, and the vector space retrieval model uses a term weighting scheme as a basic method for matching queries with documents. Term frequency-Inverse document frequency is a widely used and famous term weighting scheme, and many studies proved its effectiveness in information retrieval. However, this term weighting scheme has some drawbacks like retrieving irrelevant documents, which sometimes reduces effectiveness. From this point, a new term weighting scheme called Term Frequency with Average Term Occurrence was proposed and experienced in the English language to minimize retrieving unnecessary documents. In this paper, an information retrieval system is built for the Arabic language, and Open-Source Arabic Corpora was used to complete experiments. Calculations were made using two schemes which are traditional Term frequency-inverse Document Frequency and proposed Term Frequency with Average Term Occurrence. After that, comparisons of results were made using evaluation measures. With all obtained queries, four case studies with two approaches (stop word removal and stemming) are implemented. In English experiments, stop word removal was applied with another discriminative approach, which calculates the centroid of documents. After the analysis of the results, it was found that the proposed scheme is applicable on Arabic text and applied approaches enhance IR effectiveness if they are both implemented. Furthermore, it was found that stop word removal has a favorable effect on both schemes which was also proved in English experiments.

Keywords

Term Weighting Scheme (TWS); Term Frequency-Inverse Document Frequency (TF-IDF); Okabi BM 25 model; Term Frequency-Average Term Occurrences (TF-ATO).

Full Text:

PDF

References

R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieal, vol. 9. ACM Press NewYourk, 1999.

E. AmigÃ³, F. Giner, J. Gonzalo, and F. Verdejo, â€œOn the foundations of similarity in information access,â€ Inf. Retr. J., vol. 23, no. 3, pp. 216â€“254, 2020, doi: 10.1007/s10791-020-09375-z.

D. Harman, â€œInformation Retrieval: The Early Years,â€ Found. TrendsÂ® Inf. Retr., vol. 13, no. 5, pp. 425â€“577, 2019.

G. Domeniconi, G. Moro, R. Pasolini, and C. Sartori, â€œA study on term weighting for text categorization: A novel supervised variant of tf.idf,â€ DATA 2015 - 4th Int. Conf. Data Manag. Technol. Appl. Proc., pp. 26â€“37, 2015, doi: 10.5220/0005511900260037.

Z. H. Deng, K. H. Luo, and H. L. Yu, â€œA study of supervised term weighting scheme for sentiment analysis,â€ Expert Syst. Appl., vol. 41, no. 7, pp. 3506â€“3513, 2014, doi: 10.1016/j.eswa.2013.10.056.

D. Jones et al., â€œImproving engineering information retrieval by combining TD-IDF and product structure classification,â€ Proc. Int. Conf. Eng. Des. ICED, vol. 6, no. DS87-6, pp. 41â€“50, 2017.

S. Robertson, â€œUnderstanding inverse document frequency: On theoretical arguments for IDF,â€ J. Doc., vol. 60, no. 5, pp. 503â€“520, 2004, doi: 10.1108/00220410410560582.

I. A. & F. A. Belal Abuata, â€œImproving arabic question answering system by merging aner technique, updated question classification technique and stop words technique,â€ J. Theor. Appl. Inf. Technol., vol. 98, no. 23, pp. 24â€“38, 2020.

K. Chen, Z. Zhang, J. Long, and H. Zhang, â€œTurning from TF-IDF to TF-IGM for term weighting in text classification,â€ Expert Syst. Appl., vol. 66, pp. 1339â€“1351, 2016, doi: 10.1016/j.eswa.2016.09.009.

A. El Mahdaouy, S. O. El Alaoui, and E. Gaussier, â€œSemantically enhanced term frequency based on word embeddings for Arabic information retrieval,â€ Colloq. Inf. Sci. Technol. Cist, vol. 0, pp. 385â€“389, 2016, doi: 10.1109/CIST.2016.7805076.

O. A. S. Ibrahim and D. Landa-Silva, â€œTerm frequency with average term occurrences for textual information retrieval,â€ Soft Comput., vol. 20, no. 8, pp. 3045â€“3061, 2016, doi: 10.1007/s00500-015-1935-7.

R. Bentrcia, S. Zidat, and F. Marir, â€œExtracting semantic relations from the Quranic Arabic based on Arabic conjunctive patterns,â€ J. King Saud Univ. - Comput. Inf. Sci., vol. 30, no. 3, pp. 382â€“390, 2018, doi: 10.1016/j.jksuci.2017.09.004.

B. Abuata and A. Al-Omari, â€œA rule-based stemmer for Arabic Gulf dialect,â€ J. King Saud Univ. - Comput. Inf. Sci., vol. 27, no. 2, pp. 104â€“112, 2015, doi: 10.1016/j.jksuci.2014.04.003.

A. El Mahdaouy, Ã‰. Gaussier, and S. O. El Alaoui, â€œExploring term proximity statistic for Arabic information retrieval,â€ Colloq. Inf. Sci. Technol. Cist, vol. 2015-Janua, no. January, pp. 272â€“277, 2015, doi: 10.1109/CIST.2014.7016631.

A. A. A. A. Abdulla, H. Lin, B. Xu, and S. K. Banbhrani, â€œImproving biomedical information retrieval by linear combinations of different query expansion techniques,â€ BMC Bioinformatics, vol. 17, no. 2, 2016, doi: 10.1186/s12859-016-1092-8.

A. Aizawa, â€œAn information-theoretic perspective of tf-idf measures,â€ Inf. Process. Manag., vol. 39, no. 1, pp. 45â€“65, 2003, doi: 10.1016/S0306-4573(02)00021-3.

R. Jin, C. Falusos, and A. G. Hauptmann, â€œMeta-scoring: Automatically evaluating term weighting schemes in IR without precision-recall,â€ SIGIR Forum (ACM Spec. Interes. Gr. Inf. Retrieval), pp. 83â€“89, 2001.

G. Salton and C. Buckley, â€œTerm-weighting approaches in automatic text retrieval,â€ Inf. Process. Manag., vol. 24, no. 5, pp. 513â€“523, 1998.

Z. S. Zubi, â€œUsing some web content mining techniques for Arabic text classification,â€ Proc. 8th WSEAS Int. Conf. Data Networks, Commun. Comput. DNCOCO â€™09, pp. 73â€“84, 2009.

M. Habib, â€œAn intelligent system for automated arabic text categorization,â€ 2008.

S. E. Robertson, S. Walker, and M. M. Hancock-Beaulieu, â€œLarge test collection experiments on an operational, interactive system: Okapi at TREC,â€ Inf. Process. Manag., vol. 31, no. 3, pp. 345â€“360, 1995, doi: 10.1016/0306-4573(94)00051-4.

S. Jimenez, S. P. Cucerzan, F. A. Gonzalez, A. Gelbukh, and G. DueÃ±as, â€œBM25-CTF: Improving TF and IDF factors in BM25 by using collection term frequencies,â€ J. Intell. Fuzzy Syst., vol. 34, no. 5, pp. 2887â€“2899, 2018, doi: 10.3233/JIFS-169475.

G. Pandey, Z. Ren, S. Wang, J. Veijalainen, and M. de Rijke, â€œLinear feature extraction for ranking,â€ Inf. Retr. J., vol. 21, no. 6, pp. 481â€“506, 2018, doi: 10.1007/s10791-018-9330-5.

G. A. Tinega, P. W. Mwangi, and D. R. Rimiru, â€œText Mining in Digital Libraries using OKAPI BM25 Model,â€ Int. J. Comput. Appl. Technol. Res., vol. 7, no. 10, pp. 398â€“406, 2018, doi: 10.7753/ijcatr0710.1003.

A. Lipani, T. Roelleke, M. Lupu, and A. Hanbury, A systematic approach to normalization in probabilistic models, vol. 21, no. 6. Springer Netherlands, 2018.

M. Saad and W. Ashour, â€œOSAC: Open Source Arabic Corpora,â€ 6th Int. Conf. Electr. Comput. Syst. (EECSâ€™10), Nov 25-26, 2010, Lefke, Cyprus., pp. 118â€“123, 2010.

Nicola Ferro, â€œReproducibility Challenges in Information Retrieval Evaluation,â€ J. Data Inf. Qual., vol. 8, no. 2, pp. 1â€“4, 2017.

DOI: http://dx.doi.org/10.18517/ijaseit.12.6.13215

Refbacks

There are currently no refbacks.

Published by INSIGHT - Indonesian Society for Knowledge and Human Development

International Journal on Advanced Science, Engineering and Information Technology

Evaluation of Average Term Occurrences Weighting Technique for Arabic Textual Information Retrieval

Abstract

Keywords

Full Text:

References

Refbacks