Use of DownSampling and UpSampling techniques to address data imbalance in predicting stroke-prone individuals

Authors

DOI:

https://doi.org/10.36825/RITI.12.25.007

Keywords:

Data Imbalance, DownSampling, UpSampling, Random Forest, Machine Learning

Abstract

The DownSampling and UpSampling data balancing techniques are used applied to a set related to individuals prone to having a stroke. The purpose of this work is to demonstrate the importance of the application of DownSampling and UpSampling techniques when we find data that present imbalance; making a comparison between the two mentioned techniques and analyzing the behavior of the measures that are calculated in the confusion matrix when the prediction model is created. The data set is composed of 4981 records, of which 4773 belong to the class of those who have not suffered a stroke and 248 to the class that has had one. It was found that for this data set the best technique to treat the imbalance is UpSampling with the largest of its replicas and when the model is going to be evaluated it is important not only to base it on its Accuracy, but also on other measures that result from the confusion matrix, this to achieve a better analysis of the results obtained.

References

Martinelli, J. E. (2022). Clasificación de datos desbalanceados. Su aplicación en la predicción de bajas de beneficiarios de un servicio de salud privado. Facultad de Informática, Universidad Nacional de La Plata, Argentina. https://sedici.unlp.edu.ar/bitstream/handle/10915/147410/Documento_completo.pdf?sequence=1&isAllowed=

Kaggle (2022). Brain Stroke Dataset Classification Prediction. https://www.kaggle.com/datasets/jillanisofttech/brain-stroke-dataset

Breiman, L. (2001). Random Forest. Machine Learning, 45 (1), 5-32. http://dx.doi.org/10.1023/A:1010933404324

Del Castillo Collazo, N. (2020). Predicción en el diagnóstico de tumores de cáncer de mama empleando métodos de clasificación. Revista de Investigación en Tecnología de la Información (RITI), 8 (15), 96-104. https://doi.org/10.36825/RITI.08.15.009

Cirillo, A. (2017). R Data Mining. Implement data mining techniques through practical use cases and real-world datasets. Packt Publishing Ltd.

Villalba, F. (2018). Aprendizaje supervisado en R. https://fervilber.github.io/Aprendizaje-supervisado-en-R/bosques.html

Sotaquirá, M. (2021). Los Bosques Aleatorios: Clasificación y Regresión. https://www.codificandobits.com/blog/bosques-aleatorios/

Carrasco Calle, R. A. (2021). ¿Cómo manejar el desbalance de datos? https://datasciencepe.substack.com/p/como-manejar-el-desbalance-de-datos

Cruz-Reyes H., Reyes-Nava A., Rendón-Lara E., Alejo R. (2018). Estudio del desbalance de clases en bases de datos de microarrays de expresión genética mediante técnicas de Deep Learning. Research in Computing Science, 147 (5), 197–207. http://dx.doi.org/10.13053/rcs-147-5-15

Landa Cosio, N. A. (2021). Cómo actuar ante el desbalance de datos. https://medium.com/@nicolasarrioja/c%C3%B3mo-actuar-ante-el-desbalance-de-datos-a0d64f2b9619#:~:text=Downsampling%20consiste%20en%20quitar%20puntos,que%20la%20clase%20menos%20

Aldás, J., Uriel, E. (2017). Análisis Multivariante aplicado con R (2da Ed.). Ediciones Paraninfo.

Amazon Web Services (AWS). (2024). ¿Qué es el sobreajuste? https://aws.amazon.com/es/what-is/overfitting/

IBM. (2024). ¿Qué es el sobreajuste? https://www.ibm.com/mx-es/topics/overfitting

Published

2024-06-11

How to Cite

del Castillo Collazo, N., Contreras Arvizu, J. A., & Durán Ortega, A. J. (2024). Use of DownSampling and UpSampling techniques to address data imbalance in predicting stroke-prone individuals. Revista De Investigación En Tecnologías De La Información, 12(25), 66–78. https://doi.org/10.36825/RITI.12.25.007

Issue

Section

Artículos