Performance comparison of LLMs in the Generation of UML class diagrams using a RAG system

Authors

DOI:

https://doi.org/10.36825/RITI.13.31.007

Keywords:

Software Engineering (SE), Requirements Engineering (RE), Large Language Models (LLMs), RAG Systems, UML Diagrams

Abstract

This study presents a comparative evaluation of the performance of six LLMs (Gemini-2.5-Pro, Claude-Sonnet-4, GPT-4, DeepSeek-R1, Llama3.2, and Mistral) within a RAG system for generating UML class diagrams at the analysis level, based on user stories written in natural language. PlantUML code was used for the generation and evaluation of the diagrams, enabling a comparison between the generated diagrams and reference diagrams using the ROUGE-L metric, which focuses on average recall. The results showed that Gemini-2.5-Pro, Claude-Sonnet-4, and GPT-4 achieved better performance, with Claude-Sonnet-4 standing out by obtaining the highest average scores in most user stories. In contrast, DeepSeek-R1, Llama3.2, and Mistral presented difficulties, including the generation of invalid PlantUML code, which limited automated evaluation in some cases. The incorporation of the RAG system provided an initial foundation for exploring improvements in response quality, suggesting limitations in the quality and relevance of the retrieved context. Finally, opportunities for improvement were identified, such as prompt refinement and enhancement of the context used by the RAG system.

References

Kotti, Z., Galanopoulou, R., Spinellis, D. (2023). Machine learning for software engineering: A tertiary study. ACM Computing Surveys, 55 (12), 1-39. https://doi.org/10.1145/3572905

Hoffmann, M., Mendez, D., Fagerholm, F., Luckhardt, A. (2022). The human side of software engineering teams: an investigation of contemporary challenges. IEEE Transactions on software engineering, 49 (1), 211-225. https://doi.org/10.1109/TSE.2022.3148539

Ferrario, M. A., Winter, E. (2022). Applying human values theory to software engineering practice: Lessons and implications. IEEE Transactions on Software Engineering, 49 (3), 973-990. https://doi.org/10.1109/TSE.2022.3170087

Altameem, E. A. (2015). Impact of agile methodology on software development. Computer and Information Science, 8 (2), 9-14. https://doi.org/10.5539/cis.v8n2p9

Gupta, A., Poels, G., Bera, P. (2022). Using conceptual models in agile software development: A possible solution to requirements engineering challenges in agile projects. IEEE Access, 10, 119745-119766. https://doi.org/10.1109/ACCESS.2022.3221428

Jin, H., Huang, L., Cai, H., Yan, J., Li, B., Chen, H. (2024). From llms to llm-based agents for software engineering: A survey of current, challenges and future. arXiv preprint. https://doi.org/10.48550/arXiv.2408.02479

Feng, S., Chen, C. (2024, February). Prompting is all you need: Automated android bug replay with large language models. 46th IEEE/ACM International Conference on Software Engineering. Lisbon, Portugal. https://doi.org/10.1145/3597503.3608137

Wang, C., Pastore, F., Goknil, A., Briand, L. C. (2020). Automatic generation of acceptance test cases from use case specifications: an nlp-based approach. IEEE Transactions on Software Engineering, 48 (2), 585-616. https://doi.ieeecomputersociety.org/10.1109/TSE.2020.2998503

Liu, Z., Qian, P., Wang, X., Zhuang, Y., Qiu, L., Wang, X. (2021). Combining graph neural networks with expert knowledge for smart contract vulnerability detection. IEEE Transactions on Knowledge and Data Engineering, 35 (2), 1296-1310. https://doi.org/10.1109/TKDE.2021.3095196

van Remmen, J. S., Horber, D., Lungu, A., Chang, F., van Putten, S., Goetz, S., Wartzack, S. (2023). Natural language processing in requirements engineering and its challenges for requirements modelling in the engineering design domain. International Conference on Engineering Design (ICED). Bordeaux, France. https://doi.org/10.1017/pds.2023.277

Conrardy, A., Cabot, J. (2024). From image to uml: First results of image based uml diagram generation using llms. arXiv preprint. https://doi.org/10.48550/arXiv.2404.11376

Zheng, Z., Ning, K., Zhong, Q., Chen, J., Chen, W., Guo, L., Wang, W., Wang, Y. (2025). Towards an understanding of large language models in software engineering tasks. Empirical Software Engineering, 30 (2). https://doi.org/10.1007/s10664-024-10602-0

Hou, X., Zhao, Y., Liu, Y., Yang, Z., Wang, K., Li, L., Luo, X., Lo, D., Grundy, J., Wang, H. (2024). Large language models for software engineering: A systematic literature review. ACM Transactions on Software Engineering and Methodology, 33 (8), 1-79. https://doi.org/10.1145/3695988

Kalyan, K. S. (2024). A survey of GPT-3 family large language models including ChatGPT and GPT-4. Natural Language Processing Journal, 6, 1-48. https://doi.org/10.1016/j.nlp.2023.100048

Yao, Y., Duan, J., Xu, K., Cai, Y., Sun, Z., Zhang, Y. (2024). A survey on large language model (llm) security and privacy: The good, the bad, and the ugly. High-Confidence Computing, 4 (2), 1-21. https://doi.org/10.1016/j.hcc.2024.100211

Sonoda, Y., Kurokawa, R., Nakamura, Y., Kanzawa, J., Kurokawa, M., Ohizumi, Y., Gonoi, W., Abe, O. (2024). Diagnostic Performances of GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro in “Diagnosis Please” Cases. Japanese Journal of Radiology, 42, 1231-1235. https://doi.org/10.1007/s11604-024-01619-y

Zhao, W. X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J., Dong, Z., Du, Y., Yang, C., Chen, Y., Chen, Z., Jiang, J., Ren, R., Li, Y., Tang, X., Liu, Z., … Wen, J.-R. (2023). A Survey of Large Language Models. arXiv preprint. http://arxiv.org/abs/2303.18223

Minaee, S., Mikolov, T., Nikzad, N., Chenaghlu, M., Socher, R., Amatriain, X., Gao, J. (2024). Large Language Models: A Survey. arXiv preprint. http://arxiv.org/abs/2402.06196

Islam, R., Ahmed, I. (2024). Gemini-the most powerful LLM: Myth or Truth. IEEE 5th Information Communication Technologies Conference (ICTC). Nanjing, China. https://doi.org/10.1109/ICTC61510.2024.10602253

Lewandowska-Tomaszczyk, B., Liebeskind, C. (2024). Opinion events and stance types: advances in LLM performance with ChatGPT and Gemini. Lodz Papers in Pragmatics, 20 (2), 413-432. https://doi.org/10.1515/lpp-2024-0039

Kim, S., Huang, D., Xian, Y., Hilliges, O., Van Gool, L., Wang, X. (2024). Palm: Predicting actions through language models. European Conference on Computer Vision. Milan, Italy. https://doi.org/10.1007/978-3-031-73007-8_9

Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., ... Piao, Y. (2024). Deepseek-v3 technical report. arXiv preprint. https://doi.org/10.48550/arXiv.2412.19437

Tsai, H. C., Huang, Y. F., Kuo, C. W. (2024). Comparative analysis of automatic literature review using mistral large language model and human reviewers. Research Square. https://doi.org/10.21203/rs.3.rs-4022248/v1

Yang, A., Yu, B., Li, C., Liu, D., Huang, F., Huang, H., ... Zhang, Z. (2025). Qwen2. 5-1m technical report. arXiv preprint. https://doi.org/10.48550/arXiv.2501.15383

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., … Amodei, D. (2020). Language Models are Few-Shot Learners. 34th International Conference on Neural Information Processing Systems. Vancouver BC, Canada https://dl.acm.org/doi/abs/10.5555/3495724.3495883

Martino, A., Iannelli, M., Truong, C. (2023). Knowledge injection to counter large language model (LLM) hallucination. European Semantic Web Conference. Hersonissos, Greece. https://doi.org/10.1007/978-3-031-43458-7_34

Di Rocco, J., Di Ruscio, D., Di Sipio, C., Nguyen, P. T., Rubei, R. (2025). On the use of large language models in model-driven engineering: Software and Systems Modeling, 24, 923-948. https://doi.org/10.1007/s10270-025-01263-8

Li, C., Liu, Z., Xiao, S., Shao, Y. (2023). Making Large Language Models A Better Foundation For Dense Retrieval. arXiv preprint. http://arxiv.org/abs/2312.15503

Chen, J., Xiao, S., Zhang, P., Luo, K., Lian, D., Liu, Z. (2024). BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation. arXiv preprint. http://arxiv.org/abs/2402.03216

Meng, R., Liu, Y., Joty, S. R., Xiong, C., Zhou, Y., Yavuz, S. (2024). SFR-Embedding-Mistral: Enhance Text Retrieval with Transfer Learning. https://www.salesforce.com/blog/sfr-embedding/

De Bari, D., Garaccione, G., Coppola, R., Torchiano, M., Ardito, L. (2024). Evaluating Large Language Models in Exercises of UML Class Diagram Modeling. 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement. Barcelona, España. https://doi.org/10.1145/3674805.3690741

Ferrari, A., Abualhaijal, S., Arora, C. (2024). Model generation with LLMs: From requirements to UML sequence diagrams. IEEE 32nd International Requirements Engineering Conference Workshops (REW). Reykjavik, Iceland. https://doi.org/10.1109/REW61692.2024.00044

PlantUML. (2025). PlantUML: Class diagrams. https://real-world-plantuml.com/?type=class

Lin, C. Y. (2004). Rouge: A package for automatic evaluation of summaries https://aclanthology.org/W04-1013/

Lin, C. Y., Och, F. J. (2004). Looking for a few good metrics: ROUGE and its evaluation https://research.nii.ac.jp/ntcir/ntcir-ws4/NTCIR4-WN/OPEN/OPENSUB_Chin-Yew_Lin.pdf

Lin, S., Wang, B., Huang, Z., Li, C. (2024). A Study on a Precision Geriatric Medical Knowledge Q&A Model Based on Retrieval-Augmented Generation. Preprints. https://www.preprints.org/manuscript/202412.2424/v1

Liu, Y., Huang, L., Li, S., Chen, S., Zhou, H., Meng, F., ... Sun, X. (2023). Recall: A benchmark for llms robustness against external counterfactual knowledge. arXiv preprint. https://doi.org/10.48550/arXiv.2311.08147

Rodríguez Limón, C. (2025). Aumento de Datos Basado en Recursos Lingüísticos para RAG sobre Textos Legales en Español [Trabajo de Fin de Grado]. Universidad Politécnica de Madrid, España. https://oa.upm.es/87925/1/TFG_CARLOS_RODRIGUEZ_LIMON.pdf

Published

2025-10-26

How to Cite

Martínez Sixto, L., Fernández y Fernández, C. A., & Millán Hernández, C. E. (2025). Performance comparison of LLMs in the Generation of UML class diagrams using a RAG system. Revista De Investigación En Tecnologías De La Información, 13(31 Especial), 64–79. https://doi.org/10.36825/RITI.13.31.007