Soft Set Multivariate Distribution for Categorical Data Clustering

Iwan Tri Riyadi Yanto; Rohmat Saedudin; Sely Novita Sari; Mustafa Mat Deris; Norhalina Senan

doi:10.18517/ijaseit.11.5.15420

Soft Set Multivariate Distribution for Categorical Data Clustering

Iwan Tri Riyadi Yanto, Rohmat Saedudin, Sely Novita Sari, Mustafa Mat Deris, Norhalina Senan

Abstract

Clustering is the process of breaking down a huge dataset into smaller groups. It has been used in some field studies including pattern recognition, segmentation, and statistics with remarkable success. Clustering is a technique for dividing multivariate datasets into groups. No inherent distance measure on data category makes clustering data more challenging than numerical data. Data category can be assumed following the data from a multinomial distribution. Thus, the standard model parametric model can be used in latent class clustering based on the independent product of multinomial distributions. Meanwhile, multi-valued attributes on the categorical data can be decomposed into the standard set on a multi soft set. In this paper, a clustering technique based on soft set theory is proposed for categorical data through a multinomial distribution. The data will be represented as a multi soft set which is every soft set has its probability of being a member of the cluster. The data with the highest probability will be assigned as the member of the cluster. The experiment of the proposed technique is evaluated based on the Dunn index with regard to the number of clusters and response time. The experiment results show that the proposed technique has the lowest response time with high stability compared to baseline techniques. This study recommends a maximum number of clusters in implementation on the real data.Â

Keywords

Clustering; categorical data; soft set; multivariate.

Full Text:

PDF

References

C. Wan, M. Ye, C. Yao, and C. Wu, â€œBrain MR image segmentation based on Gaussian filtering and improved FCM clustering algorithm,â€ in 2017 10th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI), 2017, pp. 1â€“5.

R. Shanker and M. Bhattacharya, â€œBrain Tumor Segmentation of Normal and Pathological Tissues Using K-mean Clustering with Fuzzy C-mean Clustering,â€ in VipIMAGE 2017, 2018, pp. 286â€“296.

A. S. M. S. Hossain, â€œCustomer segmentation using centroid based and density based clustering algorithms,â€ in 2017 3rd International Conference on Electrical Information and Communication Technology (EICT), 2017, pp. 1â€“6.

K. V Ahammed Muneer and K. Paul Joseph, â€œPerformance Analysis of Combined k-mean and Fuzzy-c-mean Segmentation of MR Brain Images,â€ in Computational Vision and Bio Inspired Computing, 2018, pp. 830â€“836.

H. Zhou, â€œK-Means Clustering BT - Learn Data Mining Through Excel: A Step-by-Step Approach for Understanding Machine Learning Methods,â€ H. Zhou, Ed. Berkeley, CA: Apress, 2020, pp. 35â€“47.

S. Irfan, G. Dwivedi, and S. Ghosh, â€œOptimization of K-means clustering using genetic algorithm,â€ in 2017 International Conference on Computing and Communication Technologies for Smart Nation (IC3TSN), 2017, pp. 156â€“161.

B. K. D. Prasad, B. Choudhary, and B. Ankayarkanni., â€œPerformance Evaluation Model using Unsupervised K-Means Clustering,â€ in 2020 International Conference on Communication and Signal Processing (ICCSP), 2020, pp. 1456â€“1458.

W. Wei, J. Liang, X. Guo, P. Song, and Y. Sun, â€œHierarchical division clustering framework for categorical data,â€ Neurocomputing, vol. 341, pp. 118â€“134, 2019.

Z. Huang, â€œExtensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values,â€ Data Min. Knowl. Discov., vol. 2, no. 3, pp. 283â€“304, 1998.

Y. Xiao, C. Huang, J. Huang, I. Kaku, and Y. Xu, â€œOptimal mathematical programming and variable neighborhood search for k-modes categorical data clustering,â€ Pattern Recognit., vol. 90, pp. 183â€“195, 2019.

D. B. M. Maciel, G. J. A. Amaral, R. M. C. R. de Souza, and B. A. Pimentel, â€œMultivariate fuzzy k-modes algorithm,â€ Pattern Anal. Appl., vol. 20, no. 1, pp. 59â€“71, 2017.

P. S. Bishnu and V. Bhattacherjee, â€œSoftware cost estimation based on modified K-Modes clustering Algorithm,â€ Nat. Comput., vol. 15, no. 3, pp. 415â€“422, 2016.

Z. Huang and M. K. Ng, â€œA fuzzy k-modes algorithm for clustering categorical data,â€ IEEE Trans. Fuzzy Syst., vol. 7, no. 4, pp. 446â€“452, 1999.

M. S. Yang, Y. H. Chiang, C. C. Chen, and C. Y. Lai, â€œA fuzzy k-partitions model for categorical data and its comparison to the GoM model,â€ Fuzzy Sets Syst., vol. 159, no. 4, pp. 390â€“405, 2008.

A. Karim, C. Loqman, and J. Boumhidi, â€œDetermining the number of clusters using neural network and max stable set problem,â€ Procedia Comput. Sci., vol. 127, pp. 16â€“25, 2018.

S. Ben-David, D. PÃ¡l, and H. Simon, Stability of k-Means Clustering. 2007.

I. Landi, V. Mandelli, and M. V. Lombardo, â€œreval: a Python package to determine the best number of clusters with stability-based relative clustering validation,â€ arXiv, vol. 2, no. 4. arXiv, p. 100228, 27-Aug-2020.

D. G. L. Allegretti, â€œStability conditions, cluster varieties, and Riemann-Hilbert problems from surfaces,â€ Adv. Math. (N. Y)., vol. 380, p. 107610, Mar. 2021.

E. Andreotti, D. Edelmann, N. Guglielmi, and C. Lubich, â€œMeasuring the stability of spectral clustering,â€ Linear Algebra Appl., vol. 610, pp. 673â€“697, Feb. 2021.

T. Herawan and M. M. Deris, â€œOn Multi-soft Sets Construction in Information Systems BT - Emerging Intelligent Computing Technology and Applications. With Aspects of Artificial Intelligence,â€ 2009, pp. 101â€“110.

D. S. Morris, A. M. Raim, and K. F. Sellers, â€œA Conwayâ€“Maxwell-multinomial distribution for flexible modeling of clustered categorical data,â€ J. Multivar. Anal., vol. 179, p. 104651, 2020.

D.-W. Kim, K. H. Lee, and D. Lee, â€œFuzzy clustering of categorical data using fuzzy centroids,â€ Pattern Recognit. Lett., vol. 25, no. 11, pp. 1263â€“1271, Aug. 2004.

DOI: http://dx.doi.org/10.18517/ijaseit.11.5.15420

Refbacks

There are currently no refbacks.

Published by INSIGHT - Indonesian Society for Knowledge and Human Development

International Journal on Advanced Science, Engineering and Information Technology

Soft Set Multivariate Distribution for Categorical Data Clustering

Abstract

Keywords

Full Text:

References

Refbacks