Soft Set Multivariate Distribution for Categorical Data Clustering

Iwan Tri Riyadi Yanto, Rohmat Saedudin, Sely Novita Sari, Mustafa Mat Deris, Norhalina Senan


Clustering is the process of breaking down a huge dataset into smaller groups. It has been used in some field studies including pattern recognition, segmentation, and statistics with remarkable success. Clustering is a technique for dividing multivariate datasets into groups. No inherent distance measure on data category makes clustering data more challenging than numerical data. Data category can be assumed following the data from a multinomial distribution. Thus, the standard model parametric model can be used in latent class clustering based on the independent product of multinomial distributions. Meanwhile, multi-valued attributes on the categorical data can be decomposed into the standard set on a multi soft set. In this paper, a clustering technique based on soft set theory is proposed for categorical data through a multinomial distribution. The data will be represented as a multi soft set which is every soft set has its probability of being a member of the cluster. The data with the highest probability will be assigned as the member of the cluster. The experiment of the proposed technique is evaluated based on the Dunn index with regard to the number of clusters and response time. The experiment results show that the proposed technique has the lowest response time with high stability compared to baseline techniques. This study recommends a maximum number of clusters in implementation on the real data. 


Clustering; categorical data; soft set; multivariate.

Full Text:



C. Wan, M. Ye, C. Yao, and C. Wu, “Brain MR image segmentation based on Gaussian filtering and improved FCM clustering algorithm,†in 2017 10th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI), 2017, pp. 1–5.

R. Shanker and M. Bhattacharya, “Brain Tumor Segmentation of Normal and Pathological Tissues Using K-mean Clustering with Fuzzy C-mean Clustering,†in VipIMAGE 2017, 2018, pp. 286–296.

A. S. M. S. Hossain, “Customer segmentation using centroid based and density based clustering algorithms,†in 2017 3rd International Conference on Electrical Information and Communication Technology (EICT), 2017, pp. 1–6.

K. V Ahammed Muneer and K. Paul Joseph, “Performance Analysis of Combined k-mean and Fuzzy-c-mean Segmentation of MR Brain Images,†in Computational Vision and Bio Inspired Computing, 2018, pp. 830–836.

H. Zhou, “K-Means Clustering BT - Learn Data Mining Through Excel: A Step-by-Step Approach for Understanding Machine Learning Methods,†H. Zhou, Ed. Berkeley, CA: Apress, 2020, pp. 35–47.

S. Irfan, G. Dwivedi, and S. Ghosh, “Optimization of K-means clustering using genetic algorithm,†in 2017 International Conference on Computing and Communication Technologies for Smart Nation (IC3TSN), 2017, pp. 156–161.

B. K. D. Prasad, B. Choudhary, and B. Ankayarkanni., “Performance Evaluation Model using Unsupervised K-Means Clustering,†in 2020 International Conference on Communication and Signal Processing (ICCSP), 2020, pp. 1456–1458.

W. Wei, J. Liang, X. Guo, P. Song, and Y. Sun, “Hierarchical division clustering framework for categorical data,†Neurocomputing, vol. 341, pp. 118–134, 2019.

Z. Huang, “Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values,†Data Min. Knowl. Discov., vol. 2, no. 3, pp. 283–304, 1998.

Y. Xiao, C. Huang, J. Huang, I. Kaku, and Y. Xu, “Optimal mathematical programming and variable neighborhood search for k-modes categorical data clustering,†Pattern Recognit., vol. 90, pp. 183–195, 2019.

D. B. M. Maciel, G. J. A. Amaral, R. M. C. R. de Souza, and B. A. Pimentel, “Multivariate fuzzy k-modes algorithm,†Pattern Anal. Appl., vol. 20, no. 1, pp. 59–71, 2017.

P. S. Bishnu and V. Bhattacherjee, “Software cost estimation based on modified K-Modes clustering Algorithm,†Nat. Comput., vol. 15, no. 3, pp. 415–422, 2016.

Z. Huang and M. K. Ng, “A fuzzy k-modes algorithm for clustering categorical data,†IEEE Trans. Fuzzy Syst., vol. 7, no. 4, pp. 446–452, 1999.

M. S. Yang, Y. H. Chiang, C. C. Chen, and C. Y. Lai, “A fuzzy k-partitions model for categorical data and its comparison to the GoM model,†Fuzzy Sets Syst., vol. 159, no. 4, pp. 390–405, 2008.

A. Karim, C. Loqman, and J. Boumhidi, “Determining the number of clusters using neural network and max stable set problem,†Procedia Comput. Sci., vol. 127, pp. 16–25, 2018.

S. Ben-David, D. Pál, and H. Simon, Stability of k-Means Clustering. 2007.

I. Landi, V. Mandelli, and M. V. Lombardo, “reval: a Python package to determine the best number of clusters with stability-based relative clustering validation,†arXiv, vol. 2, no. 4. arXiv, p. 100228, 27-Aug-2020.

D. G. L. Allegretti, “Stability conditions, cluster varieties, and Riemann-Hilbert problems from surfaces,†Adv. Math. (N. Y)., vol. 380, p. 107610, Mar. 2021.

E. Andreotti, D. Edelmann, N. Guglielmi, and C. Lubich, “Measuring the stability of spectral clustering,†Linear Algebra Appl., vol. 610, pp. 673–697, Feb. 2021.

T. Herawan and M. M. Deris, “On Multi-soft Sets Construction in Information Systems BT - Emerging Intelligent Computing Technology and Applications. With Aspects of Artificial Intelligence,†2009, pp. 101–110.

D. S. Morris, A. M. Raim, and K. F. Sellers, “A Conway–Maxwell-multinomial distribution for flexible modeling of clustered categorical data,†J. Multivar. Anal., vol. 179, p. 104651, 2020.

D.-W. Kim, K. H. Lee, and D. Lee, “Fuzzy clustering of categorical data using fuzzy centroids,†Pattern Recognit. Lett., vol. 25, no. 11, pp. 1263–1271, Aug. 2004.



  • There are currently no refbacks.

Published by INSIGHT - Indonesian Society for Knowledge and Human Development