Improving effort estimation in software projects using oversampling and machine learning methods

Authors

DOI:

https://doi.org/10.36825/RITI.13.31.008

Keywords:

Software Effort Estimation, Software Projects, Machine Learning, Prediction, Oversampling, Regression

Abstract

Effort estimation prediction determines the time it will take to develop a software program or the resources required to complete it within the established timeframe. A current alternative for predicting estimates is to use machine learning methods. However, publicly available data sets generally contain few samples, so such methods cannot improve their effectiveness. Thus, it is necessary to increase the number of samples using oversampling methods. Therefore, this paper presents the use of ensemble methods with combinations of oversampling and undersampling to analyze the performance impact of the regressors used on small and medium-sized data sets. Moreover, their effectiveness in improving effort estimation in software projects using measures such as MMRE, MAE, RMSE, and Pred is also presented. The results obtained from MMRE and Pred, mainly show that the application of these strategies reduces prediction errors. Consequently, the use of an appropriate ensemble model, together with oversampling and undersampling strategies, improves effort prediction, especially on small data sets such as COCOMO, Maxwell, and Desharnais with highly unbalanced sample distributions.

References

Durgesh, D. V. S., Saket, M. V. S., Reddy, B. R. (2023). Improving software effort estimation with heterogeneous stacked ensemble using SMOTER over ELM and SVR base learners. En R. Morusupalli, T. S. Dandibhotla, V. V. Atluri, D. Windridge, P. Lingras, V. R. Komati (Eds.), Multi-disciplinary trends in artificial intelligence (pp. 442–448). Springer Nature Switzerland. https://doi.org/10.1007/978-3-031-36402-0_41

Sunda, N., Sinha, R. R. (2023). Optimizing effort estimation in agile software development: Traditional vs. advanced ML methods. IEEE International Conference on Communication, Security and Artificial Intelligence (ICCSAI). Greater Noida, India. https://doi.org/10.1109/ICCSAI59793.2023.10421235

Belhaouari, S. B., Islam, A., Kassoul, K., Al-Fuqaha, A., Bouzerdoum, A. (2024). Oversampling techniques for imbalanced data in regression. Expert Systems with Applications, 252, 1-19. https://doi.org/10.1016/j.eswa.2024.124118

Chawla, N. V., Bowyer, K. W., Hall, L. O., Kegelmeyer, W. P. (2002). SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357. https://doi.org/10.1613/jair.953

Avelino, J. G., Cavalcanti, G. D. C., Cruz, R. M. O. (2024). Resampling strategies for imbalanced regression: A survey and empirical analysis. Artificial Intelligence Review, 57, 82–124. https://doi.org/10.1007/s10462-024-10724-3

Moniz, N., Ribeiro, R., Cerqueira, V., & Chawla, N. (2018). SMOTEBoost for regression: Improving the prediction of extreme values. IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA). Turin, Italy. https://doi.org/10.1109/DSAA.2018.00025

Torgo, L., Ribeiro, R. P., Pfahringer, B., Branco, P. (2013). SMOTE for regression. En L. Correia, L. P. Reis, J. Cascalho (Eds.), Progress in Artificial Intelligence (pp. 378–389). Springer. https://doi.org/10.1007/978-3-642-40669-0_33

Jawa, M., Meena, S. (2022). Software effort estimation using synthetic minority over-sampling technique for regression (SMOTER). IEEE 3rd International Conference for Emerging Technology (INCET). Belgaum, India. https://doi.org/10.1109/INCET54531.2022.9824043

Yun, F. H. (2025). China: Effort estimation dataset. Zenodo. https://zenodo.org/records/268446

Li, Y. (2025). Effort estimation: Maxwell. Zenodo. https://zenodo.org/records/268461

Kaggle. (2025). Effort-estimation-on-cocomo-dataset. https://kaggle.com/code/vanlocbk1996/effort-estimation-on-cocomo-dataset

Esteves, A. (2025). Software effort estimation. https://github.com/yy2111/Software-Effort-Estimation/blob/master/Datasets/02.desharnais.csv

Bhattacharyya, A., Srijith, K., Behera, R. P., Dasgupta, A., Chakraborty, R. S. (2024). A study on effects of synthetic data for predicting the remaining useful life of aluminium electrolytic capacitors using bagging-based ensemble learning. International Conference on Advances in Data-driven Computing and Intelligent Systems (ADCIS). Goa, India. https://doi.org/10.1007/978-981-99-9518-9_40

Qi, L., Zhihao, L., & Jianxiao, Z. I. (2024). A SMOGN-based MPSO-BP model to predict the height of a hydraulically conductive fracture zone. Coal Geology & Exploration, 52 (11), 72–85. https://cge.researchcommons.org/journal/vol52/iss11/7/

Rad, M., Rafiei, A., Grunwell, J., Kamaleswaran, R. (2025). Tackling the small imbalanced horizontal dataset regressions by stability selection and SMOGN: A case study of ventilation-free days prediction in the pediatric intensive care unit and the importance of PRISM. International Journal of Medical Informatics, 196. https://doi.org/10.1016/j.ijmedinf.2025.105809

Branco, P., Torgo, L., Ribeiro, R. P. (2017). SMOGN: A pre-processing approach for imbalanced regression. First International Workshop on Learning with Imbalanced Domains: Theory and Applications (LIDTA). Skopje, Macedonia. https://proceedings.mlr.press/v74/branco17a/branco17a.pdf

Rahman, M., Sarwar, H., Kader, M. D. A., Gonçalves, T., Tin, T. T. (2024). Review and empirical analysis of machine learning-based software effort estimation. IEEE Access, 12, 85661–85680. https://doi.org/10.1109/ACCESS.2024.3404879

Abid, M., Bukhari, S., Saqlain, M. (2025). Enhancing software effort estimation in healthcare informatics: A comparative analysis of machine learning models with correlation-based feature selection. Sustainable Machine Intelligence, 10, 50–66. https://doi.org/10.61356/SMIJ.2025.10451

Mienye, I. D., Sun, Y. (2022). A survey of ensemble learning: Concepts, algorithms, applications, and prospects. IEEE Access, 10, 99129–99149. https://doi.org/10.1109/ACCESS.2022.3207287

Varshini, A. G. P., Kumari, K. A., Janani, D., Soundariya, S. (2021). Comparative analysis of machine learning and deep learning algorithms for software effort estimation. Journal of Physics: Conference Series, 1767, 1-11. https://doi.org/10.1088/1742-6596/1767/1/012019

Şengüneş, B., Öztürk, N. (2023). An artificial neural network model for project effort estimation. Systems, 11 (2), 1-22. https://doi.org/10.3390/systems11020091

Zakrani, A., Hain, M., Idri, A. (2019). Improving software development effort estimating using support vector regression and feature selection. IAES International Journal of Artificial Intelligence, 8 (4), 399–410. https://doi.org/10.11591/ijai.v8.i4.pp399-410

Rahman, M., Roy, P. P., Ali, M., Goncalves, T., Sarwar, H. (2023). Software effort estimation using machine learning technique. International Journal of Advanced Computer Science and Applications, 14 (4), 822–827. https://doi.org/10.14569/IJACSA.2023.0140491

Published

2025-10-26

How to Cite

Bedolla Martínez, B., Cruz-Barbosa, R., & García Pacheco, I. A. (2025). Improving effort estimation in software projects using oversampling and machine learning methods. Revista De Investigación En Tecnologías De La Información, 13(31 Especial), 80–93. https://doi.org/10.36825/RITI.13.31.008