Comparative Analysis of Different Data Representations for the Task of Chemical Compound Extraction

Basel Alshaikhdeeb, Kamsuriah Ahmad


Chemical Compound Extraction refers to the task of recognizing chemical instances such as oxygen nitrogen and others. The majority of studies that addressed the task of chemical compound extraction used machine-learning techniques. The key challenge behind using machine-learning techniques lies in employing a robust set of features. In fact, the literature shows that there are numerous types of features used in the task of chemical compound extraction. Such dimensionality of features can be determined via data representation. Some researchers have used N-gram representation for biomedical-named entity recognition, where the most significant terms are represented as features. Meanwhile, others have used detailed-attribute representation in which the features are generalized. As a result, identifying the best combination of features to yield high-accuracy classification becomes challenging. This paper aims to apply the Wrapper Subset Selection approach using two data representations—N-gram and detailed-attributes. Since each data representation would suit a specific classification algorithm, two classifiers were utilized—Naïve Bayes (for detailed-attributes) and Support Vector Machine (for N-gram). The results show that the application of feature selection using detailed-attributes outperformed that of N-gram representation by achieving a 0.722 f-measure. Despite the higher classification accuracy, the selected features using detailed-attribute representation have more meaning and can be applied for further datasets.


Chemical Compounds Extraction; Data Representation; N-gram; Detailed-Attributes; Naïve Bayes; Support Vector Machine; Attribute Selection

Full Text:



Basel Alshaikhdeeb and Kamsuriah Ahmad, "Integrating correlation clustering and agglomerative hierarchical clustering

for holistic schema matching," Journal of Computer Science, vol. 11, p. 484, 2015.

B. Alshaikhdeeb and K. Ahmad, "Feature selection for chemical compound extraction using wrapper approach with Naive Bayes classifier," in 2017 6th International Conference on Electrical Engineering and Informatics (ICEEI), 2017, pp. 1-6.doi:10.1109/ICEEI.2017.8312421.

Yaoyun Zhang, Jun Xu, Hui Chen, Jingqi Wang, Yonghui Wu, Manu Prakasam, and Hua Xu, "Chemical named entity recognition in patents by domain knowledge and unsupervised feature learning," Database, vol. 2016, p. baw049, 2016.

Baydaa Hashim and Nazlia Omar, "A Back Propagation Neural Network for Identifying Multi-Word Biomedical Named Entities," 2016, vol. 11, 2016.doi:682-690

Basel Alshaikhdeeb and Kamsuriah Ahmad, "Biomedical Named Entity Recognition: A Review," International Journal on Advanced Science, Engineering and Information Technology, vol. 6, 2016.

Tim Rocktäschel, Michael Weidlich, and Ulf Leser, "ChemSpot: a hybrid system for chemical named entity recognition," Bioinformatics, vol. 28, pp. 1633-1640, 2012.

Andre Lamurias, Tiago Grego, and Francisco M Couto, "Chemical compound and drug name recognition using CRFs and semantic similarity based on ChEBI," in BioCreative Challenge Evaluation Workshop, 2013, p. 75.doi.

Riza Batista-Navarro, Rafal Rak, and Sophia Ananiadou, "Optimising chemical named entity recognition with pre-processing analytics, knowledge-rich features and heuristics," J Chem Inf, vol. 7, p. S6, 2015.

Anabel Usié, Joaquim Cruz, Jorge Comas, F Solson, and Rui Alves, "CheNER: a tool for the identification of chemical entities and their classes in biomedical literature," J Cheminform, vol. 7, p. S15, 2015.

Haider Banka and Suresh Dara, "A Hamming distance based binary particle swarm optimization (HDBPSO) algorithm for high dimensional feature selection, classification and validation," Pattern Recognition Letters, vol. 52, pp. 94-100, 2015/01/15/ 2015.doi:

Iñaki Inza, Pedro Larrañaga, Rosa Blanco, and Antonio J Cerrolaza, "Filter versus wrapper gene selection approaches in DNA microarray domains," Artificial intelligence in medicine, vol. 31, pp. 91-103, 2004.

Robert Leaman, "Advancing biomedical named entity recognition with multivariate feature selection and semantically motivated features," Arizona State University, 2013Retrieved from.

Corinna Kolárik, Roman Klinger, Christoph M Friedrich, Martin Hofmann-Apitius, and Juliane Fluck, "Chemical names: terminological resources and corpora annotation," in Workshop on Building and evaluating resources for biomedical text mining (6th edition of the Language Resources and Evaluation Conference), 2008.

E Alharbi and S Tiun, "A Hybrid Method of Linguistic Features and Clustering Approach for Identifying Biomedical Named Entities," Asian Journal of Applied Sciences, vol. 8, pp. 210-216, 2015.

Stanford, "Part-of-Speech Tagger," ed, 2014.

Peter Willett, "The Porter stemming algorithm: then and now," Program, vol. 40, pp. 219-223, 2006.

Bo Tang, Steven Kay, and Haibo He, "Toward optimal feature selection in naive Bayes for text categorization," 2016.

Songbo Tan, Xueqi Cheng, Yuefen Wang, and Hongbo Xu, "Adapting naive bayes to domain adaptation for sentiment analysis," in Advances in Information Retrieval, ed: Springer, 2009, pp. 337-349.

Ahmed Almusawi and Haleh Amintoosi, "DNS Tunneling Detection Method Based on Multilabel Support Vector Machine," Security and Communication Networks, vol. 2018, 2018.

Samaneh Moghaddam and Martin Ester, "AQA: aspect-based opinion question answering," in Data Mining Workshops (ICDMW), 2011 IEEE 11th International Conference on, 2011, pp. 89-96.doi.

Suzanne Little, Ovidio Salvetti, and Petra Perner, "Evaluation of feature subset selection, feature weighting, and prototype selection for biomedical applications," Advances in Case-Based Reasoning, pp. 312-324, 2008.

Yuri Bykov and Sanja Petrovic, "A Step Counting Hill Climbing Algorithm applied to University Examination Timetabling," Journal of Scheduling, vol. 19, pp. 479-492, 2016.

Ruizhi Li, Shuli Hu, Yiyuan Wang, and Minghao Yin, "A local search algorithm with tabu strategy and perturbation mechanism for generalized vertex cover problem," Neural Computing and Applications, vol. 28, pp. 1775-1785, 2017.

Ivan Piza-Davila, Guillermo Sanchez-Diaz, Manuel S Lazo-Cortes, and Luis Rizo-Dominguez, "A CUDA-based Hill-climbing Algorithm to Find Irreducible Testors from a Training Matrix," Pattern Recognition Letters, 2017.



  • There are currently no refbacks.

Published by INSIGHT - Indonesian Society for Knowledge and Human Development