Grid Search CV Implementation in Random Forest Algorithm to Improve Accuracy of Breast Cancer Data

Dimas Aryo Anggoro, Nur Aini Afdallah

Abstract


Breast cancer is the most common cancer in women and is the second leading cause of global death. Disease diagnosis plays an important role in determining treatment strategies related to patient safety. Therefore, we need machine learning to predict disease. This paper aims to determine the best parameter values in breast cancer data using the Grid Search CV method and classify breast cancer data using the random forest algorithm. In addition, the paper aims to compare the accuracy values generated using the Grid Search CV and without the Grid Search CV. The method used to analyze breast cancer data in researchers is the Random Forest (RF) classification algorithm. In addition to using the Random Forest algorithm, this study also uses the Grid Search CV method. Grid Search CV is a method used to determine the optimal model parameters so that the classifier can predict the test data reliably. This study indicates that the highest accuracy value is obtained in the random forest algorithm using the grid search method of 0.9545. In contrast, the accuracy of the random forest algorithm without using the grid search method is 0.9480. For further research, it is suggested to develop a breast cancer dataset using the grid search cv method with other algorithms, such as Logistic Regression, Xgboost, and SVM. We can also use the same algorithm with different datasets to prove that the grid search cv method can increase accuracy.

Keywords


Accuracy; breast cancer; grid search cv; random forest.

Full Text:

PDF

References


M. F. Ullah, Breast Cancer: Current Perspectives on the Disease Status. 2019.

I. L. Maria, A. A. Sainal, and M. Nyorong, “Risiko Gaya Hidup Terhadap Kejadian Kanker Payudara Pada Wanita,†Media Kesehat. Masy. Indones., vol. 13, no. 2, p. 157, 2017, doi: 10.30597/mkmi.v13i2.1988.

C. Mattiuzzi and G. Lippi, "Current Cancer Epidemiology glossary," J. Epidemiol. Glob. Health, vol. 9, no. 4, pp. 217–222, 2019, doi: DOI: https://doi.org/10.2991/jegh.k.191008.001.

H. Wang, B. Zheng, S. W. Yoon, and H. S. Ko, "A support vector machine-based ensemble algorithm for breast cancer diagnosis," Eur. J. Oper. Res., vol. 267, no. 2, pp. 687–699, 2018, doi: 10.1016/j.ejor.2017.12.001.

C. E. DeSantis et al., "Breast cancer statistics, 2019," CA. Cancer J. Clin., vol. 69, no. 6, pp. 438–451, 2019, doi: 10.3322/caac.21583.

Y. Ao, H. Li, L. Zhu, S. Ali, and Z. Yang, "The linear random forest algorithm and its advantages in machine learning assisted logging regression modeling," J. Pet. Sci. Eng., vol. 174, pp. 776–789, 2019, doi: 10.1016/j.petrol.2018.11.067.

J. Dou et al., "Assessment of advanced random forest and decision tree algorithms for modeling rainfall-induced landslide susceptibility in the Izu-Oshima Volcanic Island, Japan," Sci. Total Environ., vol. 662, no. January, pp. 332–346, 2019, doi: 10.1016/j.scitotenv.2019.01.221.

A. R. Chowdhury, T. Chatterjee, and S. Banerjee, "A Random Forest classifier-based approach in the detection of abnormalities in the retina," Med. Biol. Eng. Comput., vol. 57, no. 1, pp. 193–203, 2019, doi: 10.1007/s11517-018-1878-0.

W. Dong, Y. Huang, B. Lehane, and G. Ma, "XGBoost algorithm-based prediction of concrete electrical resistivity for structural health monitoring," Autom. Constr., vol. 114, no. March, p. 103155, 2020, doi: 10.1016/j.autcon.2020.103155.

Y. Shuai, Y. Zheng, and H. Huang, "Hybrid Software Obsolescence Evaluation Model Based on PCA-SVM-GridSearchCV," Proc. IEEE Int. Conf. Softw. Eng. Serv. Sci. ICSESS, vol. 2018, pp. 449–453, 2019, doi: 10.1109/ICSESS.2018.8663753.

M. M. Ramadhan, I. S. Sitanggang, F. R. Nasution, and A. Ghifari, "Parameter Tuning in Random Forest Based on Grid Search Method for Gender Classification Based on Voice Frequency," DEStech Trans. Comput. Sci. Eng., pp. 625–629, 2017, doi: 10.12783/dtcse/cece2017/14611.

W. H. Wolberg and O. L. Mangasarian, "Multisurface method of pattern separation for medical diagnosis applied to breast cytology," Proc. Natl. Acad. Sci. U. S. A., vol. 87, no. 23, pp. 9193–9196, 1990, doi: 10.1073/pnas.87.23.9193.

H. He, Y. Bai, E. A. Garcia, and S. Li, "ADASYN: Adaptive synthetic sampling approach for imbalanced learning," Proc. Int. Jt. Conf. Neural Networks, no. 3, pp. 1322–1328, 2008, doi: 10.1109/IJCNN.2008.4633969.

B. Tan, J. Yang, Y. Tang, S. Jiang, P. Xie, and W. Yuan, "A Deep Imbalanced Learning Framework for Transient Stability Assessment of Power System," IEEE Access, vol. 7, pp. 81759–81769, 2019, doi: 10.1109/ACCESS.2019.2923799.

H. Zhao, J. Zheng, J. Xu, and W. Deng, "Fault diagnosis method based on principal component analysis and broad learning system," IEEE Access, vol. 7, pp. 99263–99272, 2019, doi: 10.1109/ACCESS.2019.2929094.

Adiwijaya, U. N. Wisesty, E. Lisnawati, A. Aditsania, and D. S. Kusumo, "Dimensionality reduction using Principal Component Analysis for cancer detection based on microarray data classification," J. Comput. Sci., vol. 14, no. 11, pp. 1521–1530, 2018, doi: 10.3844/jcssp.2018.1521.1530.

A. N. Zuda Pradana Putra, “Pebandingan Performa Naïve Bayes dan KNN pada Klasifikasi Teks Sentimen Jasa Ekspedisi,†vol. 3, no. 1, pp. 145–152, 2022.

S. Benbelkacem and B. Atmani, "Random forests for diabetes diagnosis," 2019 Int. Conf. Comput. Inf. Sci. ICCIS 2019, pp. 1–4, 2019, doi: 10.1109/ICCISci.2019.8716405.

I. Syarif, A. Prugel-Bennett, and G. Wills, "SVM parameter optimization using grid search and genetic algorithm to improve classification performance," Telkomnika (Telecommunication Comput. Electron. Control., vol. 14, no. 4, pp. 1502–1509, 2016, doi: 10.12928/TELKOMNIKA.v14i4.3956.

H. Zhang, H. Zhang, S. Pirbhulal, W. Wu, and V. H. C. D. Albuquerque, "Active balancing mechanism for imbalanced medical data in deep learning-based classification models," ACM Trans. Multimed. Comput. Commun. Appl., vol. 16, pp. 1–15, 2020, doi: 10.1145/3357253.

R. Maglietta et al., "Convolutional Neural Networks for Risso's Dolphins Identification," IEEE Access, vol. 8, pp. 80195–80206, 2020, doi: 10.1109/ACCESS.2020.2990427.

D. A. Anggoro and P. I. Rahmatullah, "The implementation of subspace outlier detection in K-nearest neighbors to improve accuracy in bank marketing data," Int. J. Emerg. Trends Eng. Res., vol. 8, no. 2, pp. 545–550, 2020, doi: 10.30534/ijeter/2020/44822020.

T. H. Kerbaa, A. Mezache, and H. Oudira, "Model Selection of Sea Clutter Using Cross Validation Method," Procedia Comput. Sci., vol. 158, pp. 394–400, 2019, doi: 10.1016/j.procs.2019.09.067.

G. A. Buntoro, "Analisi Sentimen Hatespeech Pada Twitter dengan Metode Naive Bayes Classifier dan Support Vector Machine," Resma, vol. 3, no. 2, pp. 13–22, 2016.




DOI: http://dx.doi.org/10.18517/ijaseit.12.2.15487

Refbacks

  • There are currently no refbacks.



Published by INSIGHT - Indonesian Society for Knowledge and Human Development