Analysing Kinship in Severe Acute Respiratory Syndrome Coronavirus 2 DNA Sequences Based on Hierarchical and K-Means Clustering Methods Using Multiple Encoding Vector

Evander Banjarnahor, Alhadi Bustamam, Titin Siswantining, Patuan Tampubolon


Based on the World Health Organization data obtained in mid-April 2021, Coronavirus or Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) has already infected more than 134.9 million people worldwide. The virus attacks human breathing, which can cause lung infections and even death. More than 2.9 million people worldwide have died due to coronavirus infection. Meanwhile in Indonesia, more than 1.5 million people has been infected and 42.5 thousand people died because of this coronavirus. Based on this data, it is important to carry out a kinship analysis of the coronavirus to reduce its spread. Identification of the kinship of the COVID-19 virus and its spread can be done by forming a phylogenetic tree and clustering. This study uses the Multiple Encoding Vector method in analysing the sequences and Euclidean distance to determine the distance matrix. This research will then use the Hierarchical clustering method to determine the number of initial centroids, which will be used later by the K-Means clustering method kinship in the SARS-CoV-2 DNA sequence. This study took samples of DNA sequences of SARS-CoV-2 from several infected countries. From the simulation results, the ancestors of SARS-CoV-2 came from China. The results of the analysis also show that the closest ancestors of COVID-19 to Indonesia came from India. The SARS-CoV-2 DNA sequence also consisted of nine clusters, and the sixth cluster has the most number of members.


Sequence alignment; bioinformatics; clustering; DNA kinship; phylogenetic analysis.

Full Text:



S. Namasudra, "Data Access Control in the Cloud Computing Environment for Bioinformatics," Int. J. Appl. Res. Bioinforma., vol. 11, no. 1, pp. 40–50, Jan. 2021, doi: 10.4018/ijarb.2021010105.

S. R. Manisekhar, G. M. Siddesh, and S. S. Manvi, "Introduction to Bioinformatics," in Statistical Modelling and Machine Learning Principles for Bioinformatics Techniques, Tools, and Applications, Springer Singapore, 2020, pp. 3–9.

E. Banjarnahor, A. Bustamam, T. Siswantining, and W. Mangunwardoyo, "K-Means Clustering and Analyze of SARS-CoV 2 DNA based on Multiple Encoding Vector and K-Mer Method," Ann. Rom. Soc. Cell Biol., vol. 25, no. 4, pp. 18647–18658, 2021.

M. Crochemore, G. Fici, R. Mercacs, and S. P. Pissis, "Linear-Time Sequence Comparison Using Minimal Absent Words & Applications," in 2016: Theoretical Informatics, Springer Berlin Heidelberg, 2016, pp. 334–346.

A. Bustamam, H. Tasman, N. Yuniarti, Frisca, and I. Mursidah, "Application of k-means clustering algorithm in grouping the DNA sequences of hepatitis B virus (HBV)," 2017, doi: 10.1063/1.4991238.

A. Bustamam, T. Siswantining, N. L. Febriyani, I. D. Novitasari, and R. D. Cahyaningrum, “Protein sequences clustering of herpes virus by using Tribe Markov clustering (Tribe-MCL),†2017, doi: 10.1063/1.4991254.

A. Bustamam, E. D. Ulul, H. F. A. Hura, and T. Siswantining, "Implementation of hierarchical clustering using k-mer sparse matrix to analyze MERS CoV genetic relationship," 2017, doi: 10.1063/1.4991246.

Y. Li, L. He, R. L. He, and S. S.-T. Yau, "A novel fast vector method for genetic sequence comparison," Sci. Rep., vol. 7, no. 1, Sep. 2017, doi: 10.1038/s41598-017-12493-2.

K. Qian and Y. Luan, "Phylogenetic analysis of DNA sequences based on fractional Fourier transform," Phys. A Stat. Mech. its Appl., vol. 509, pp. 795–808, Nov. 2018, doi: 10.1016/j.physa.2018.06.044.

H.-H. Huang and S. B. Girimurugan, "A Novel Real-Time Genome Comparison Method Using Discrete Wavelet Transform," J. Comput. Biol., vol. 25, no. 4, pp. 405–416, Apr. 2018, doi: 10.1089/cmb.2017.0115.

A. Criscuolo, "A fast alignment-free bioinformatics procedure to infer accurate distance-based phylogenetic trees from genome assemblies," Res. Ideas Outcomes, vol. 5, Jun. 2019, doi: 10.3897/rio.5.e36178.

Y. Zhang, J. Wen, and S. S.-T. Yau, "Phylogenetic analysis of protein sequences based on a novel k-mer natural vector method," Genomics, vol. 111, no. 6, pp. 1298–1305, Dec. 2019, doi: 10.1016/j.ygeno.2018.08.010.

L. Muflikhah, Widodo, W. F. Mahmudy, and Solimun, "DNA Sequence of Hepatitis B Virus Clustering Using Hierarchical k-Means Algorithm," in 2019 IEEE 6th International Conference on Engineering Technologies and Applied Sciences (ICETAS), 2019, doi: 10.1109/icetas48360.2019.9117565.

Y. Gao, T. Li, and L. Luo, "Phylogenetic study of 2019-nCoV by using alignment-free method." 2020.

Y. Ma, Z. Yu, R. Tang, X. Xie, G. Han, and V. V Anh, "Phylogenetic Analysis of HIV-1 Genomes Based on the Position-Weighted K-mers Method," Entropy, vol. 22, no. 2, p. 255, Feb. 2020, doi: 10.3390/e22020255.

G. Gamage et al., "Phylogenetic Tree Construction Using K-Mer Forest- Based Distance Calculation," Int. J. Online Biomed. Eng., vol. 16, no. 07, p. 4, Jun. 2020, doi: 10.3991/ijoe.v16i07.13807.

S. Das, A. Das, B. Mondal, N. Dey, D. K. Bhattacharya, and D. N. Tibarewala, "Genome sequence comparison under a new form of tri-nucleotide representation based on bio-chemical properties of nucleotides," Gene, vol. 730, p. 144257, Mar. 2020, doi: 10.1016/j.gene.2019.144257.

L. He, R. Dong, R. L. He, and S. S.-T. Yau, "Positional Correlation Natural Vector: A Novel Method for Genome Comparison," Int. J. Mol. Sci., vol. 21, no. 11, p. 3859, May 2020, doi: 10.3390/ijms21113859.

N. De Maio, "The Cumulative Indel Model: Fast and Accurate Statistical Evolutionary Alignment," Syst. Biol., Jul. 2020, doi: 10.1093/sysbio/syaa050.

S. Amiroch, M. I. Irawan, I. Mukhlash, A. Nur, M. Ansori, and A. Nidom, "Identification of the Spread of the Influenza Virus Type A / H9N2 in Indonesia Using the Neighbor-Joining Algorithm with Felsenstein Models," 2021.

J. K. Das, A. Sengupta, P. P. Choudhury, and S. Roy, "Mapping sequence to feature vector using numerical representation of codons targeted to amino acids for alignment-free sequence analysis," Gene, vol. 766, p. 145096, Jan. 2021, doi: 10.1016/j.gene.2020.145096.

E. Banjarnahor, A. Bustamam, W. Mangunwardoyo, and D. Sarwinda, "Implementation of Hierarchical Clustering Method in Analyzing Genetic Relationship on DNA SARS-CoV-2 Sequences," J. Phys. Conf. Ser., vol. 1811, no. 1, p. 12074, Mar. 2021, doi: 10.1088/1742-6596/1811/1/012074.

P. Zhou et al., “A pneumonia outbreak associated with a new coronavirus of probable bat origin,†Nature, vol. 579, no. 7798, pp. 270–273, Feb. 2020, doi: 10.1038/s41586-020-2012-7.

Y. Gao et al., "Structure of the RNA-dependent RNA polymerase from COVID-19 virus," Science (80-. )., vol. 368, no. 6492, pp. 779–782, Apr. 2020, doi: 10.1126/science.abb7498.

C.-C. Lai, T.-P. Shih, W.-C. Ko, H.-J. Tang, and P.-R. Hsueh, “Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) and coronavirus disease-2019 (COVID-19): The epidemic and the challenges,†Int. J. Antimicrob. Agents, vol. 55, no. 3, p. 105924, Mar. 2020, doi: 10.1016/j.ijantimicag.2020.105924.

M. T. Lwin, M. M. Aye, and others, "A modified hierarchical agglomerative approach for efficient document clustering system," Am. Sci. Res. J. Eng. Technol. Sci., vol. 29, no. 1, pp. 228–238, 2017.

P. Yildirim and D. Birant, "K-Linkage: A New Agglomerative Approach for Hierarchical Clustering," Adv. Electr. Comput. Eng., vol. 17, no. 4, pp. 77–88, 2017, doi: 10.4316/aece.2017.04010.



  • There are currently no refbacks.

Published by INSIGHT - Indonesian Society for Knowledge and Human Development