Evaluation of Average Term Occurrences Weighting Technique for Arabic Textual Information Retrieval

Belal Mustafa Abuata, Lama Ali Al Omari

Abstract


Information retrieval of documents is an important process in the current time, and the vector space retrieval model uses a term weighting scheme as a basic method for matching queries with documents. Term frequency-Inverse document frequency is a widely used and famous term weighting scheme, and many studies proved its effectiveness in information retrieval. However, this term weighting scheme has some drawbacks like retrieving irrelevant documents, which sometimes reduces effectiveness. From this point, a new term weighting scheme called Term Frequency with Average Term Occurrence was proposed and experienced in the English language to minimize retrieving unnecessary documents. In this paper, an information retrieval system is built for the Arabic language, and Open-Source Arabic Corpora was used to complete experiments. Calculations were made using two schemes which are traditional Term frequency-inverse Document Frequency and proposed Term Frequency with Average Term Occurrence. After that, comparisons of results were made using evaluation measures. With all obtained queries, four case studies with two approaches (stop word removal and stemming) are implemented. In English experiments, stop word removal was applied with another discriminative approach, which calculates the centroid of documents. After the analysis of the results, it was found that the proposed scheme is applicable on Arabic text and applied approaches enhance IR effectiveness if they are both implemented. Furthermore, it was found that stop word removal has a favorable effect on both schemes which was also proved in English experiments.

Keywords


Term Weighting Scheme (TWS); Term Frequency-Inverse Document Frequency (TF-IDF); Okabi BM 25 model; Term Frequency-Average Term Occurrences (TF-ATO).

Full Text:

PDF

References


R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieal, vol. 9. ACM Press NewYourk, 1999.

E. Amigó, F. Giner, J. Gonzalo, and F. Verdejo, “On the foundations of similarity in information access,†Inf. Retr. J., vol. 23, no. 3, pp. 216–254, 2020, doi: 10.1007/s10791-020-09375-z.

D. Harman, “Information Retrieval: The Early Years,†Found. Trends® Inf. Retr., vol. 13, no. 5, pp. 425–577, 2019.

G. Domeniconi, G. Moro, R. Pasolini, and C. Sartori, “A study on term weighting for text categorization: A novel supervised variant of tf.idf,†DATA 2015 - 4th Int. Conf. Data Manag. Technol. Appl. Proc., pp. 26–37, 2015, doi: 10.5220/0005511900260037.

Z. H. Deng, K. H. Luo, and H. L. Yu, “A study of supervised term weighting scheme for sentiment analysis,†Expert Syst. Appl., vol. 41, no. 7, pp. 3506–3513, 2014, doi: 10.1016/j.eswa.2013.10.056.

D. Jones et al., “Improving engineering information retrieval by combining TD-IDF and product structure classification,†Proc. Int. Conf. Eng. Des. ICED, vol. 6, no. DS87-6, pp. 41–50, 2017.

S. Robertson, “Understanding inverse document frequency: On theoretical arguments for IDF,†J. Doc., vol. 60, no. 5, pp. 503–520, 2004, doi: 10.1108/00220410410560582.

I. A. & F. A. Belal Abuata, “Improving arabic question answering system by merging aner technique, updated question classification technique and stop words technique,†J. Theor. Appl. Inf. Technol., vol. 98, no. 23, pp. 24–38, 2020.

K. Chen, Z. Zhang, J. Long, and H. Zhang, “Turning from TF-IDF to TF-IGM for term weighting in text classification,†Expert Syst. Appl., vol. 66, pp. 1339–1351, 2016, doi: 10.1016/j.eswa.2016.09.009.

A. El Mahdaouy, S. O. El Alaoui, and E. Gaussier, “Semantically enhanced term frequency based on word embeddings for Arabic information retrieval,†Colloq. Inf. Sci. Technol. Cist, vol. 0, pp. 385–389, 2016, doi: 10.1109/CIST.2016.7805076.

O. A. S. Ibrahim and D. Landa-Silva, “Term frequency with average term occurrences for textual information retrieval,†Soft Comput., vol. 20, no. 8, pp. 3045–3061, 2016, doi: 10.1007/s00500-015-1935-7.

R. Bentrcia, S. Zidat, and F. Marir, “Extracting semantic relations from the Quranic Arabic based on Arabic conjunctive patterns,†J. King Saud Univ. - Comput. Inf. Sci., vol. 30, no. 3, pp. 382–390, 2018, doi: 10.1016/j.jksuci.2017.09.004.

B. Abuata and A. Al-Omari, “A rule-based stemmer for Arabic Gulf dialect,†J. King Saud Univ. - Comput. Inf. Sci., vol. 27, no. 2, pp. 104–112, 2015, doi: 10.1016/j.jksuci.2014.04.003.

A. El Mahdaouy, É. Gaussier, and S. O. El Alaoui, “Exploring term proximity statistic for Arabic information retrieval,†Colloq. Inf. Sci. Technol. Cist, vol. 2015-Janua, no. January, pp. 272–277, 2015, doi: 10.1109/CIST.2014.7016631.

A. A. A. A. Abdulla, H. Lin, B. Xu, and S. K. Banbhrani, “Improving biomedical information retrieval by linear combinations of different query expansion techniques,†BMC Bioinformatics, vol. 17, no. 2, 2016, doi: 10.1186/s12859-016-1092-8.

A. Aizawa, “An information-theoretic perspective of tf-idf measures,†Inf. Process. Manag., vol. 39, no. 1, pp. 45–65, 2003, doi: 10.1016/S0306-4573(02)00021-3.

R. Jin, C. Falusos, and A. G. Hauptmann, “Meta-scoring: Automatically evaluating term weighting schemes in IR without precision-recall,†SIGIR Forum (ACM Spec. Interes. Gr. Inf. Retrieval), pp. 83–89, 2001.

G. Salton and C. Buckley, “Term-weighting approaches in automatic text retrieval,†Inf. Process. Manag., vol. 24, no. 5, pp. 513–523, 1998.

Z. S. Zubi, “Using some web content mining techniques for Arabic text classification,†Proc. 8th WSEAS Int. Conf. Data Networks, Commun. Comput. DNCOCO ’09, pp. 73–84, 2009.

M. Habib, “An intelligent system for automated arabic text categorization,†2008.

S. E. Robertson, S. Walker, and M. M. Hancock-Beaulieu, “Large test collection experiments on an operational, interactive system: Okapi at TREC,†Inf. Process. Manag., vol. 31, no. 3, pp. 345–360, 1995, doi: 10.1016/0306-4573(94)00051-4.

S. Jimenez, S. P. Cucerzan, F. A. Gonzalez, A. Gelbukh, and G. Dueñas, “BM25-CTF: Improving TF and IDF factors in BM25 by using collection term frequencies,†J. Intell. Fuzzy Syst., vol. 34, no. 5, pp. 2887–2899, 2018, doi: 10.3233/JIFS-169475.

G. Pandey, Z. Ren, S. Wang, J. Veijalainen, and M. de Rijke, “Linear feature extraction for ranking,†Inf. Retr. J., vol. 21, no. 6, pp. 481–506, 2018, doi: 10.1007/s10791-018-9330-5.

G. A. Tinega, P. W. Mwangi, and D. R. Rimiru, “Text Mining in Digital Libraries using OKAPI BM25 Model,†Int. J. Comput. Appl. Technol. Res., vol. 7, no. 10, pp. 398–406, 2018, doi: 10.7753/ijcatr0710.1003.

A. Lipani, T. Roelleke, M. Lupu, and A. Hanbury, A systematic approach to normalization in probabilistic models, vol. 21, no. 6. Springer Netherlands, 2018.

M. Saad and W. Ashour, “OSAC: Open Source Arabic Corpora,†6th Int. Conf. Electr. Comput. Syst. (EECS’10), Nov 25-26, 2010, Lefke, Cyprus., pp. 118–123, 2010.

Nicola Ferro, “Reproducibility Challenges in Information Retrieval Evaluation,†J. Data Inf. Qual., vol. 8, no. 2, pp. 1–4, 2017.




DOI: http://dx.doi.org/10.18517/ijaseit.12.6.13215

Refbacks

  • There are currently no refbacks.



Published by INSIGHT - Indonesian Society for Knowledge and Human Development