Uso de las técnicas DownSampling y UpSampling para abordar el desequilibrio de datos en la predicción de personas propensas a sufrir accidentes cerebrovasculares

Nelson del Castillo Collazo; Juan Antonio Contreras Arvizu; Adalberto Joel Durán Ortega

doi:10.36825/RITI.12.25.007

Authors

Nelson del Castillo Collazo IIMAS, UNAM https://orcid.org/0000-0002-4187-5511
Juan Antonio Contreras Arvizu IIMAS, UNAM https://orcid.org/0000-0003-1375-5499
Adalberto Joel Durán Ortega IIMAS, UNAM https://orcid.org/0000-0003-4272-7294

DOI:

https://doi.org/10.36825/RITI.12.25.007

Keywords:

Data Imbalance, DownSampling, UpSampling, Random Forest, Machine Learning

Abstract

The DownSampling and UpSampling data balancing techniques are used applied to a set related to individuals prone to having a stroke. The purpose of this work is to demonstrate the importance of the application of DownSampling and UpSampling techniques when we find data that present imbalance; making a comparison between the two mentioned techniques and analyzing the behavior of the measures that are calculated in the confusion matrix when the prediction model is created. The data set is composed of 4981 records, of which 4773 belong to the class of those who have not suffered a stroke and 248 to the class that has had one. It was found that for this data set the best technique to treat the imbalance is UpSampling with the largest of its replicas and when the model is going to be evaluated it is important not only to base it on its Accuracy, but also on other measures that result from the confusion matrix, this to achieve a better analysis of the results obtained.

References

Martinelli, J. E. (2022). Clasificación de datos desbalanceados. Su aplicación en la predicción de bajas de beneficiarios de un servicio de salud privado. Facultad de Informática, Universidad Nacional de La Plata, Argentina. https://sedici.unlp.edu.ar/bitstream/handle/10915/147410/Documento_completo.pdf?sequence=1&isAllowed=

Kaggle (2022). Brain Stroke Dataset Classification Prediction. https://www.kaggle.com/datasets/jillanisofttech/brain-stroke-dataset

Breiman, L. (2001). Random Forest. Machine Learning, 45 (1), 5-32. http://dx.doi.org/10.1023/A:1010933404324

Del Castillo Collazo, N. (2020). Predicción en el diagnóstico de tumores de cáncer de mama empleando métodos de clasificación. Revista de Investigación en Tecnología de la Información (RITI), 8 (15), 96-104. https://doi.org/10.36825/RITI.08.15.009

Cirillo, A. (2017). R Data Mining. Implement data mining techniques through practical use cases and real-world datasets. Packt Publishing Ltd.

Villalba, F. (2018). Aprendizaje supervisado en R. https://fervilber.github.io/Aprendizaje-supervisado-en-R/bosques.html

Sotaquirá, M. (2021). Los Bosques Aleatorios: Clasificación y Regresión. https://www.codificandobits.com/blog/bosques-aleatorios/

Carrasco Calle, R. A. (2021). ¿Cómo manejar el desbalance de datos? https://datasciencepe.substack.com/p/como-manejar-el-desbalance-de-datos

Cruz-Reyes H., Reyes-Nava A., Rendón-Lara E., Alejo R. (2018). Estudio del desbalance de clases en bases de datos de microarrays de expresión genética mediante técnicas de Deep Learning. Research in Computing Science, 147 (5), 197–207. http://dx.doi.org/10.13053/rcs-147-5-15

Landa Cosio, N. A. (2021). Cómo actuar ante el desbalance de datos. https://medium.com/@nicolasarrioja/c%C3%B3mo-actuar-ante-el-desbalance-de-datos-a0d64f2b9619#:~:text=Downsampling%20consiste%20en%20quitar%20puntos,que%20la%20clase%20menos%20

Aldás, J., Uriel, E. (2017). Análisis Multivariante aplicado con R (2da Ed.). Ediciones Paraninfo.

Amazon Web Services (AWS). (2024). ¿Qué es el sobreajuste? https://aws.amazon.com/es/what-is/overfitting/

IBM. (2024). ¿Qué es el sobreajuste? https://www.ibm.com/mx-es/topics/overfitting

Use of DownSampling and UpSampling techniques to address data imbalance in predicting stroke-prone individuals

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Most read articles by the same author(s)

Make a Submission

Language

Information