Comparación del desempeño de LLMs en la generación de diagramas de clases UML mediante un sistema RAG
DOI:
https://doi.org/10.36825/RITI.13.31.007Palabras clave:
Ingeniería de Software (SE), Ingeniería de Requisitos (RE), Modelos de Lenguaje Largo (LLMs), Sistemas RAG, Diagramas UMLResumen
Este estudio presenta una evaluación comparativa del desempeño de seis modelos LLM (Gemini-2.5-Pro, Claude-Sonnet-4, GPT-4, DeepSeek-R1, Llama3.2 y Mistral) en un sistema RAG para la generación de diagramas de clases UML a nivel de análisis, a partir de historias de usuario redactadas en lenguaje natural. Para la generación y evaluación de los diagramas se utilizó código PlantUML, lo que permitió comparar los diagramas generados con los diagramas de referencia mediante la métrica ROUGE-L, que se centra en el recall promedio. Los resultados demostraron que los modelos Gemini-2.5-Pro, Claude-Sonnet-4 y GPT-4 obtuvieron un mejor desempeño, destacando Claude-Sonnet-4 por alcanzar los puntajes promedio más altos en la mayoría de las historias de usuario. En contraste, los modelos DeepSeek-R1, Llama3.2 y Mistral presentaron dificultades, incluyendo la generación de código inválido en PlantUML, lo cual limitó la evaluación automática en ciertos casos. La incorporación del sistema RAG brindó una base inicial para explorar mejoras en la calidad de las respuestas, lo que sugiere limitaciones en la calidad y pertinencia del contexto recuperado. Finalmente, se identificaron oportunidades de mejora, como el afinamiento del prompt y la mejora del contexto utilizado por el sistema RAG.
Citas
Kotti, Z., Galanopoulou, R., Spinellis, D. (2023). Machine learning for software engineering: A tertiary study. ACM Computing Surveys, 55 (12), 1-39. https://doi.org/10.1145/3572905
Hoffmann, M., Mendez, D., Fagerholm, F., Luckhardt, A. (2022). The human side of software engineering teams: an investigation of contemporary challenges. IEEE Transactions on software engineering, 49 (1), 211-225. https://doi.org/10.1109/TSE.2022.3148539
Ferrario, M. A., Winter, E. (2022). Applying human values theory to software engineering practice: Lessons and implications. IEEE Transactions on Software Engineering, 49 (3), 973-990. https://doi.org/10.1109/TSE.2022.3170087
Altameem, E. A. (2015). Impact of agile methodology on software development. Computer and Information Science, 8 (2), 9-14. https://doi.org/10.5539/cis.v8n2p9
Gupta, A., Poels, G., Bera, P. (2022). Using conceptual models in agile software development: A possible solution to requirements engineering challenges in agile projects. IEEE Access, 10, 119745-119766. https://doi.org/10.1109/ACCESS.2022.3221428
Jin, H., Huang, L., Cai, H., Yan, J., Li, B., Chen, H. (2024). From llms to llm-based agents for software engineering: A survey of current, challenges and future. arXiv preprint. https://doi.org/10.48550/arXiv.2408.02479
Feng, S., Chen, C. (2024, February). Prompting is all you need: Automated android bug replay with large language models. 46th IEEE/ACM International Conference on Software Engineering. Lisbon, Portugal. https://doi.org/10.1145/3597503.3608137
Wang, C., Pastore, F., Goknil, A., Briand, L. C. (2020). Automatic generation of acceptance test cases from use case specifications: an nlp-based approach. IEEE Transactions on Software Engineering, 48 (2), 585-616. https://doi.ieeecomputersociety.org/10.1109/TSE.2020.2998503
Liu, Z., Qian, P., Wang, X., Zhuang, Y., Qiu, L., Wang, X. (2021). Combining graph neural networks with expert knowledge for smart contract vulnerability detection. IEEE Transactions on Knowledge and Data Engineering, 35 (2), 1296-1310. https://doi.org/10.1109/TKDE.2021.3095196
van Remmen, J. S., Horber, D., Lungu, A., Chang, F., van Putten, S., Goetz, S., Wartzack, S. (2023). Natural language processing in requirements engineering and its challenges for requirements modelling in the engineering design domain. International Conference on Engineering Design (ICED). Bordeaux, France. https://doi.org/10.1017/pds.2023.277
Conrardy, A., Cabot, J. (2024). From image to uml: First results of image based uml diagram generation using llms. arXiv preprint. https://doi.org/10.48550/arXiv.2404.11376
Zheng, Z., Ning, K., Zhong, Q., Chen, J., Chen, W., Guo, L., Wang, W., Wang, Y. (2025). Towards an understanding of large language models in software engineering tasks. Empirical Software Engineering, 30 (2). https://doi.org/10.1007/s10664-024-10602-0
Hou, X., Zhao, Y., Liu, Y., Yang, Z., Wang, K., Li, L., Luo, X., Lo, D., Grundy, J., Wang, H. (2024). Large language models for software engineering: A systematic literature review. ACM Transactions on Software Engineering and Methodology, 33 (8), 1-79. https://doi.org/10.1145/3695988
Kalyan, K. S. (2024). A survey of GPT-3 family large language models including ChatGPT and GPT-4. Natural Language Processing Journal, 6, 1-48. https://doi.org/10.1016/j.nlp.2023.100048
Yao, Y., Duan, J., Xu, K., Cai, Y., Sun, Z., Zhang, Y. (2024). A survey on large language model (llm) security and privacy: The good, the bad, and the ugly. High-Confidence Computing, 4 (2), 1-21. https://doi.org/10.1016/j.hcc.2024.100211
Sonoda, Y., Kurokawa, R., Nakamura, Y., Kanzawa, J., Kurokawa, M., Ohizumi, Y., Gonoi, W., Abe, O. (2024). Diagnostic Performances of GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro in “Diagnosis Please” Cases. Japanese Journal of Radiology, 42, 1231-1235. https://doi.org/10.1007/s11604-024-01619-y
Zhao, W. X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J., Dong, Z., Du, Y., Yang, C., Chen, Y., Chen, Z., Jiang, J., Ren, R., Li, Y., Tang, X., Liu, Z., … Wen, J.-R. (2023). A Survey of Large Language Models. arXiv preprint. http://arxiv.org/abs/2303.18223
Minaee, S., Mikolov, T., Nikzad, N., Chenaghlu, M., Socher, R., Amatriain, X., Gao, J. (2024). Large Language Models: A Survey. arXiv preprint. http://arxiv.org/abs/2402.06196
Islam, R., Ahmed, I. (2024). Gemini-the most powerful LLM: Myth or Truth. IEEE 5th Information Communication Technologies Conference (ICTC). Nanjing, China. https://doi.org/10.1109/ICTC61510.2024.10602253
Lewandowska-Tomaszczyk, B., Liebeskind, C. (2024). Opinion events and stance types: advances in LLM performance with ChatGPT and Gemini. Lodz Papers in Pragmatics, 20 (2), 413-432. https://doi.org/10.1515/lpp-2024-0039
Kim, S., Huang, D., Xian, Y., Hilliges, O., Van Gool, L., Wang, X. (2024). Palm: Predicting actions through language models. European Conference on Computer Vision. Milan, Italy. https://doi.org/10.1007/978-3-031-73007-8_9
Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., ... Piao, Y. (2024). Deepseek-v3 technical report. arXiv preprint. https://doi.org/10.48550/arXiv.2412.19437
Tsai, H. C., Huang, Y. F., Kuo, C. W. (2024). Comparative analysis of automatic literature review using mistral large language model and human reviewers. Research Square. https://doi.org/10.21203/rs.3.rs-4022248/v1
Yang, A., Yu, B., Li, C., Liu, D., Huang, F., Huang, H., ... Zhang, Z. (2025). Qwen2. 5-1m technical report. arXiv preprint. https://doi.org/10.48550/arXiv.2501.15383
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., … Amodei, D. (2020). Language Models are Few-Shot Learners. 34th International Conference on Neural Information Processing Systems. Vancouver BC, Canada https://dl.acm.org/doi/abs/10.5555/3495724.3495883
Martino, A., Iannelli, M., Truong, C. (2023). Knowledge injection to counter large language model (LLM) hallucination. European Semantic Web Conference. Hersonissos, Greece. https://doi.org/10.1007/978-3-031-43458-7_34
Di Rocco, J., Di Ruscio, D., Di Sipio, C., Nguyen, P. T., Rubei, R. (2025). On the use of large language models in model-driven engineering: Software and Systems Modeling, 24, 923-948. https://doi.org/10.1007/s10270-025-01263-8
Li, C., Liu, Z., Xiao, S., Shao, Y. (2023). Making Large Language Models A Better Foundation For Dense Retrieval. arXiv preprint. http://arxiv.org/abs/2312.15503
Chen, J., Xiao, S., Zhang, P., Luo, K., Lian, D., Liu, Z. (2024). BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation. arXiv preprint. http://arxiv.org/abs/2402.03216
Meng, R., Liu, Y., Joty, S. R., Xiong, C., Zhou, Y., Yavuz, S. (2024). SFR-Embedding-Mistral: Enhance Text Retrieval with Transfer Learning. https://www.salesforce.com/blog/sfr-embedding/
De Bari, D., Garaccione, G., Coppola, R., Torchiano, M., Ardito, L. (2024). Evaluating Large Language Models in Exercises of UML Class Diagram Modeling. 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement. Barcelona, España. https://doi.org/10.1145/3674805.3690741
Ferrari, A., Abualhaijal, S., Arora, C. (2024). Model generation with LLMs: From requirements to UML sequence diagrams. IEEE 32nd International Requirements Engineering Conference Workshops (REW). Reykjavik, Iceland. https://doi.org/10.1109/REW61692.2024.00044
PlantUML. (2025). PlantUML: Class diagrams. https://real-world-plantuml.com/?type=class
Lin, C. Y. (2004). Rouge: A package for automatic evaluation of summaries https://aclanthology.org/W04-1013/
Lin, C. Y., Och, F. J. (2004). Looking for a few good metrics: ROUGE and its evaluation https://research.nii.ac.jp/ntcir/ntcir-ws4/NTCIR4-WN/OPEN/OPENSUB_Chin-Yew_Lin.pdf
Lin, S., Wang, B., Huang, Z., Li, C. (2024). A Study on a Precision Geriatric Medical Knowledge Q&A Model Based on Retrieval-Augmented Generation. Preprints. https://www.preprints.org/manuscript/202412.2424/v1
Liu, Y., Huang, L., Li, S., Chen, S., Zhou, H., Meng, F., ... Sun, X. (2023). Recall: A benchmark for llms robustness against external counterfactual knowledge. arXiv preprint. https://doi.org/10.48550/arXiv.2311.08147
Rodríguez Limón, C. (2025). Aumento de Datos Basado en Recursos Lingüísticos para RAG sobre Textos Legales en Español [Trabajo de Fin de Grado]. Universidad Politécnica de Madrid, España. https://oa.upm.es/87925/1/TFG_CARLOS_RODRIGUEZ_LIMON.pdf
Descargas
Publicado
Cómo citar
Número
Sección
Licencia
Derechos de autor 2025 Revista de Investigación en Tecnologías de la Información

Esta obra está bajo una licencia internacional Creative Commons Atribución-NoComercial 4.0.
Esta revista proporciona un acceso abierto a su contenido, basado en el principio de que ofrecer al público un acceso libre a las investigaciones ayuda a un mayor intercambio global del conocimiento.
El texto publicado en la Revista de Investigación en Tecnologías de la Información (RITI) se distribuye bajo la licencia Creative Commons (CC BY-NC![]()
), que permite a terceros utilizar lo publicado citando a los autores del trabajo y a RITI, pero sin hacer uso del material con propósitos comerciales.
