Performance comparison of LLMs in the Generation of UML class diagrams using a RAG system
DOI:
https://doi.org/10.36825/RITI.13.31.007Keywords:
Software Engineering (SE), Requirements Engineering (RE), Large Language Models (LLMs), RAG Systems, UML DiagramsAbstract
This study presents a comparative evaluation of the performance of six LLMs (Gemini-2.5-Pro, Claude-Sonnet-4, GPT-4, DeepSeek-R1, Llama3.2, and Mistral) within a RAG system for generating UML class diagrams at the analysis level, based on user stories written in natural language. PlantUML code was used for the generation and evaluation of the diagrams, enabling a comparison between the generated diagrams and reference diagrams using the ROUGE-L metric, which focuses on average recall. The results showed that Gemini-2.5-Pro, Claude-Sonnet-4, and GPT-4 achieved better performance, with Claude-Sonnet-4 standing out by obtaining the highest average scores in most user stories. In contrast, DeepSeek-R1, Llama3.2, and Mistral presented difficulties, including the generation of invalid PlantUML code, which limited automated evaluation in some cases. The incorporation of the RAG system provided an initial foundation for exploring improvements in response quality, suggesting limitations in the quality and relevance of the retrieved context. Finally, opportunities for improvement were identified, such as prompt refinement and enhancement of the context used by the RAG system.
References
Kotti, Z., Galanopoulou, R., Spinellis, D. (2023). Machine learning for software engineering: A tertiary study. ACM Computing Surveys, 55 (12), 1-39. https://doi.org/10.1145/3572905
Hoffmann, M., Mendez, D., Fagerholm, F., Luckhardt, A. (2022). The human side of software engineering teams: an investigation of contemporary challenges. IEEE Transactions on software engineering, 49 (1), 211-225. https://doi.org/10.1109/TSE.2022.3148539
Ferrario, M. A., Winter, E. (2022). Applying human values theory to software engineering practice: Lessons and implications. IEEE Transactions on Software Engineering, 49 (3), 973-990. https://doi.org/10.1109/TSE.2022.3170087
Altameem, E. A. (2015). Impact of agile methodology on software development. Computer and Information Science, 8 (2), 9-14. https://doi.org/10.5539/cis.v8n2p9
Gupta, A., Poels, G., Bera, P. (2022). Using conceptual models in agile software development: A possible solution to requirements engineering challenges in agile projects. IEEE Access, 10, 119745-119766. https://doi.org/10.1109/ACCESS.2022.3221428
Jin, H., Huang, L., Cai, H., Yan, J., Li, B., Chen, H. (2024). From llms to llm-based agents for software engineering: A survey of current, challenges and future. arXiv preprint. https://doi.org/10.48550/arXiv.2408.02479
Feng, S., Chen, C. (2024, February). Prompting is all you need: Automated android bug replay with large language models. 46th IEEE/ACM International Conference on Software Engineering. Lisbon, Portugal. https://doi.org/10.1145/3597503.3608137
Wang, C., Pastore, F., Goknil, A., Briand, L. C. (2020). Automatic generation of acceptance test cases from use case specifications: an nlp-based approach. IEEE Transactions on Software Engineering, 48 (2), 585-616. https://doi.ieeecomputersociety.org/10.1109/TSE.2020.2998503
Liu, Z., Qian, P., Wang, X., Zhuang, Y., Qiu, L., Wang, X. (2021). Combining graph neural networks with expert knowledge for smart contract vulnerability detection. IEEE Transactions on Knowledge and Data Engineering, 35 (2), 1296-1310. https://doi.org/10.1109/TKDE.2021.3095196
van Remmen, J. S., Horber, D., Lungu, A., Chang, F., van Putten, S., Goetz, S., Wartzack, S. (2023). Natural language processing in requirements engineering and its challenges for requirements modelling in the engineering design domain. International Conference on Engineering Design (ICED). Bordeaux, France. https://doi.org/10.1017/pds.2023.277
Conrardy, A., Cabot, J. (2024). From image to uml: First results of image based uml diagram generation using llms. arXiv preprint. https://doi.org/10.48550/arXiv.2404.11376
Zheng, Z., Ning, K., Zhong, Q., Chen, J., Chen, W., Guo, L., Wang, W., Wang, Y. (2025). Towards an understanding of large language models in software engineering tasks. Empirical Software Engineering, 30 (2). https://doi.org/10.1007/s10664-024-10602-0
Hou, X., Zhao, Y., Liu, Y., Yang, Z., Wang, K., Li, L., Luo, X., Lo, D., Grundy, J., Wang, H. (2024). Large language models for software engineering: A systematic literature review. ACM Transactions on Software Engineering and Methodology, 33 (8), 1-79. https://doi.org/10.1145/3695988
Kalyan, K. S. (2024). A survey of GPT-3 family large language models including ChatGPT and GPT-4. Natural Language Processing Journal, 6, 1-48. https://doi.org/10.1016/j.nlp.2023.100048
Yao, Y., Duan, J., Xu, K., Cai, Y., Sun, Z., Zhang, Y. (2024). A survey on large language model (llm) security and privacy: The good, the bad, and the ugly. High-Confidence Computing, 4 (2), 1-21. https://doi.org/10.1016/j.hcc.2024.100211
Sonoda, Y., Kurokawa, R., Nakamura, Y., Kanzawa, J., Kurokawa, M., Ohizumi, Y., Gonoi, W., Abe, O. (2024). Diagnostic Performances of GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro in “Diagnosis Please” Cases. Japanese Journal of Radiology, 42, 1231-1235. https://doi.org/10.1007/s11604-024-01619-y
Zhao, W. X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J., Dong, Z., Du, Y., Yang, C., Chen, Y., Chen, Z., Jiang, J., Ren, R., Li, Y., Tang, X., Liu, Z., … Wen, J.-R. (2023). A Survey of Large Language Models. arXiv preprint. http://arxiv.org/abs/2303.18223
Minaee, S., Mikolov, T., Nikzad, N., Chenaghlu, M., Socher, R., Amatriain, X., Gao, J. (2024). Large Language Models: A Survey. arXiv preprint. http://arxiv.org/abs/2402.06196
Islam, R., Ahmed, I. (2024). Gemini-the most powerful LLM: Myth or Truth. IEEE 5th Information Communication Technologies Conference (ICTC). Nanjing, China. https://doi.org/10.1109/ICTC61510.2024.10602253
Lewandowska-Tomaszczyk, B., Liebeskind, C. (2024). Opinion events and stance types: advances in LLM performance with ChatGPT and Gemini. Lodz Papers in Pragmatics, 20 (2), 413-432. https://doi.org/10.1515/lpp-2024-0039
Kim, S., Huang, D., Xian, Y., Hilliges, O., Van Gool, L., Wang, X. (2024). Palm: Predicting actions through language models. European Conference on Computer Vision. Milan, Italy. https://doi.org/10.1007/978-3-031-73007-8_9
Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., ... Piao, Y. (2024). Deepseek-v3 technical report. arXiv preprint. https://doi.org/10.48550/arXiv.2412.19437
Tsai, H. C., Huang, Y. F., Kuo, C. W. (2024). Comparative analysis of automatic literature review using mistral large language model and human reviewers. Research Square. https://doi.org/10.21203/rs.3.rs-4022248/v1
Yang, A., Yu, B., Li, C., Liu, D., Huang, F., Huang, H., ... Zhang, Z. (2025). Qwen2. 5-1m technical report. arXiv preprint. https://doi.org/10.48550/arXiv.2501.15383
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., … Amodei, D. (2020). Language Models are Few-Shot Learners. 34th International Conference on Neural Information Processing Systems. Vancouver BC, Canada https://dl.acm.org/doi/abs/10.5555/3495724.3495883
Martino, A., Iannelli, M., Truong, C. (2023). Knowledge injection to counter large language model (LLM) hallucination. European Semantic Web Conference. Hersonissos, Greece. https://doi.org/10.1007/978-3-031-43458-7_34
Di Rocco, J., Di Ruscio, D., Di Sipio, C., Nguyen, P. T., Rubei, R. (2025). On the use of large language models in model-driven engineering: Software and Systems Modeling, 24, 923-948. https://doi.org/10.1007/s10270-025-01263-8
Li, C., Liu, Z., Xiao, S., Shao, Y. (2023). Making Large Language Models A Better Foundation For Dense Retrieval. arXiv preprint. http://arxiv.org/abs/2312.15503
Chen, J., Xiao, S., Zhang, P., Luo, K., Lian, D., Liu, Z. (2024). BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation. arXiv preprint. http://arxiv.org/abs/2402.03216
Meng, R., Liu, Y., Joty, S. R., Xiong, C., Zhou, Y., Yavuz, S. (2024). SFR-Embedding-Mistral: Enhance Text Retrieval with Transfer Learning. https://www.salesforce.com/blog/sfr-embedding/
De Bari, D., Garaccione, G., Coppola, R., Torchiano, M., Ardito, L. (2024). Evaluating Large Language Models in Exercises of UML Class Diagram Modeling. 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement. Barcelona, España. https://doi.org/10.1145/3674805.3690741
Ferrari, A., Abualhaijal, S., Arora, C. (2024). Model generation with LLMs: From requirements to UML sequence diagrams. IEEE 32nd International Requirements Engineering Conference Workshops (REW). Reykjavik, Iceland. https://doi.org/10.1109/REW61692.2024.00044
PlantUML. (2025). PlantUML: Class diagrams. https://real-world-plantuml.com/?type=class
Lin, C. Y. (2004). Rouge: A package for automatic evaluation of summaries https://aclanthology.org/W04-1013/
Lin, C. Y., Och, F. J. (2004). Looking for a few good metrics: ROUGE and its evaluation https://research.nii.ac.jp/ntcir/ntcir-ws4/NTCIR4-WN/OPEN/OPENSUB_Chin-Yew_Lin.pdf
Lin, S., Wang, B., Huang, Z., Li, C. (2024). A Study on a Precision Geriatric Medical Knowledge Q&A Model Based on Retrieval-Augmented Generation. Preprints. https://www.preprints.org/manuscript/202412.2424/v1
Liu, Y., Huang, L., Li, S., Chen, S., Zhou, H., Meng, F., ... Sun, X. (2023). Recall: A benchmark for llms robustness against external counterfactual knowledge. arXiv preprint. https://doi.org/10.48550/arXiv.2311.08147
Rodríguez Limón, C. (2025). Aumento de Datos Basado en Recursos Lingüísticos para RAG sobre Textos Legales en Español [Trabajo de Fin de Grado]. Universidad Politécnica de Madrid, España. https://oa.upm.es/87925/1/TFG_CARLOS_RODRIGUEZ_LIMON.pdf
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Revista de Investigación en Tecnologías de la Información

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
Esta revista proporciona un acceso abierto a su contenido, basado en el principio de que ofrecer al público un acceso libre a las investigaciones ayuda a un mayor intercambio global del conocimiento.
El texto publicado en la Revista de Investigación en Tecnologías de la Información (RITI) se distribuye bajo la licencia Creative Commons (CC BY-NC![]()
), que permite a terceros utilizar lo publicado citando a los autores del trabajo y a RITI, pero sin hacer uso del material con propósitos comerciales.
