Comparación del desempeño de LLMs en la generación de diagramas de clases UML mediante un sistema RAG

Lorena Martínez Sixto; Carlos Alberto Fernández y Fernández; Christian Eduardo Millán Hernández

doi:10.36825/RITI.13.31.007

Autores/as

Lorena Martínez Sixto Universidad Tecnológica de la Mixteca, Oaxaca, México
Carlos Alberto Fernández y Fernández Universidad Tecnológica de la Mixteca, Oaxaca, México https://orcid.org/0000-0002-1586-8772
Christian Eduardo Millán Hernández Universidad Tecnológica de la Mixteca, Oaxaca, México https://orcid.org/0000-0002-8683-7500

DOI:

https://doi.org/10.36825/RITI.13.31.007

Palabras clave:

Ingeniería de Software (SE), Ingeniería de Requisitos (RE), Modelos de Lenguaje Largo (LLMs), Sistemas RAG, Diagramas UML

Resumen

Este estudio presenta una evaluación comparativa del desempeño de seis modelos LLM (Gemini-2.5-Pro, Claude-Sonnet-4, GPT-4, DeepSeek-R1, Llama3.2 y Mistral) en un sistema RAG para la generación de diagramas de clases UML a nivel de análisis, a partir de historias de usuario redactadas en lenguaje natural. Para la generación y evaluación de los diagramas se utilizó código PlantUML, lo que permitió comparar los diagramas generados con los diagramas de referencia mediante la métrica ROUGE-L, que se centra en el recall promedio. Los resultados demostraron que los modelos Gemini-2.5-Pro, Claude-Sonnet-4 y GPT-4 obtuvieron un mejor desempeño, destacando Claude-Sonnet-4 por alcanzar los puntajes promedio más altos en la mayoría de las historias de usuario. En contraste, los modelos DeepSeek-R1, Llama3.2 y Mistral presentaron dificultades, incluyendo la generación de código inválido en PlantUML, lo cual limitó la evaluación automática en ciertos casos. La incorporación del sistema RAG brindó una base inicial para explorar mejoras en la calidad de las respuestas, lo que sugiere limitaciones en la calidad y pertinencia del contexto recuperado. Finalmente, se identificaron oportunidades de mejora, como el afinamiento del prompt y la mejora del contexto utilizado por el sistema RAG.

Citas

Kotti, Z., Galanopoulou, R., Spinellis, D. (2023). Machine learning for software engineering: A tertiary study. ACM Computing Surveys, 55 (12), 1-39. https://doi.org/10.1145/3572905

Hoffmann, M., Mendez, D., Fagerholm, F., Luckhardt, A. (2022). The human side of software engineering teams: an investigation of contemporary challenges. IEEE Transactions on software engineering, 49 (1), 211-225. https://doi.org/10.1109/TSE.2022.3148539

Ferrario, M. A., Winter, E. (2022). Applying human values theory to software engineering practice: Lessons and implications. IEEE Transactions on Software Engineering, 49 (3), 973-990. https://doi.org/10.1109/TSE.2022.3170087

Altameem, E. A. (2015). Impact of agile methodology on software development. Computer and Information Science, 8 (2), 9-14. https://doi.org/10.5539/cis.v8n2p9

Gupta, A., Poels, G., Bera, P. (2022). Using conceptual models in agile software development: A possible solution to requirements engineering challenges in agile projects. IEEE Access, 10, 119745-119766. https://doi.org/10.1109/ACCESS.2022.3221428

Jin, H., Huang, L., Cai, H., Yan, J., Li, B., Chen, H. (2024). From llms to llm-based agents for software engineering: A survey of current, challenges and future. arXiv preprint. https://doi.org/10.48550/arXiv.2408.02479

Feng, S., Chen, C. (2024, February). Prompting is all you need: Automated android bug replay with large language models. 46th IEEE/ACM International Conference on Software Engineering. Lisbon, Portugal. https://doi.org/10.1145/3597503.3608137

Wang, C., Pastore, F., Goknil, A., Briand, L. C. (2020). Automatic generation of acceptance test cases from use case specifications: an nlp-based approach. IEEE Transactions on Software Engineering, 48 (2), 585-616. https://doi.ieeecomputersociety.org/10.1109/TSE.2020.2998503

Liu, Z., Qian, P., Wang, X., Zhuang, Y., Qiu, L., Wang, X. (2021). Combining graph neural networks with expert knowledge for smart contract vulnerability detection. IEEE Transactions on Knowledge and Data Engineering, 35 (2), 1296-1310. https://doi.org/10.1109/TKDE.2021.3095196

van Remmen, J. S., Horber, D., Lungu, A., Chang, F., van Putten, S., Goetz, S., Wartzack, S. (2023). Natural language processing in requirements engineering and its challenges for requirements modelling in the engineering design domain. International Conference on Engineering Design (ICED). Bordeaux, France. https://doi.org/10.1017/pds.2023.277

Conrardy, A., Cabot, J. (2024). From image to uml: First results of image based uml diagram generation using llms. arXiv preprint. https://doi.org/10.48550/arXiv.2404.11376

Zheng, Z., Ning, K., Zhong, Q., Chen, J., Chen, W., Guo, L., Wang, W., Wang, Y. (2025). Towards an understanding of large language models in software engineering tasks. Empirical Software Engineering, 30 (2). https://doi.org/10.1007/s10664-024-10602-0

Hou, X., Zhao, Y., Liu, Y., Yang, Z., Wang, K., Li, L., Luo, X., Lo, D., Grundy, J., Wang, H. (2024). Large language models for software engineering: A systematic literature review. ACM Transactions on Software Engineering and Methodology, 33 (8), 1-79. https://doi.org/10.1145/3695988

Kalyan, K. S. (2024). A survey of GPT-3 family large language models including ChatGPT and GPT-4. Natural Language Processing Journal, 6, 1-48. https://doi.org/10.1016/j.nlp.2023.100048

Yao, Y., Duan, J., Xu, K., Cai, Y., Sun, Z., Zhang, Y. (2024). A survey on large language model (llm) security and privacy: The good, the bad, and the ugly. High-Confidence Computing, 4 (2), 1-21. https://doi.org/10.1016/j.hcc.2024.100211

Sonoda, Y., Kurokawa, R., Nakamura, Y., Kanzawa, J., Kurokawa, M., Ohizumi, Y., Gonoi, W., Abe, O. (2024). Diagnostic Performances of GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro in “Diagnosis Please” Cases. Japanese Journal of Radiology, 42, 1231-1235. https://doi.org/10.1007/s11604-024-01619-y

Zhao, W. X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J., Dong, Z., Du, Y., Yang, C., Chen, Y., Chen, Z., Jiang, J., Ren, R., Li, Y., Tang, X., Liu, Z., … Wen, J.-R. (2023). A Survey of Large Language Models. arXiv preprint. http://arxiv.org/abs/2303.18223

Minaee, S., Mikolov, T., Nikzad, N., Chenaghlu, M., Socher, R., Amatriain, X., Gao, J. (2024). Large Language Models: A Survey. arXiv preprint. http://arxiv.org/abs/2402.06196

Islam, R., Ahmed, I. (2024). Gemini-the most powerful LLM: Myth or Truth. IEEE 5th Information Communication Technologies Conference (ICTC). Nanjing, China. https://doi.org/10.1109/ICTC61510.2024.10602253

Lewandowska-Tomaszczyk, B., Liebeskind, C. (2024). Opinion events and stance types: advances in LLM performance with ChatGPT and Gemini. Lodz Papers in Pragmatics, 20 (2), 413-432. https://doi.org/10.1515/lpp-2024-0039

Kim, S., Huang, D., Xian, Y., Hilliges, O., Van Gool, L., Wang, X. (2024). Palm: Predicting actions through language models. European Conference on Computer Vision. Milan, Italy. https://doi.org/10.1007/978-3-031-73007-8_9

Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., ... Piao, Y. (2024). Deepseek-v3 technical report. arXiv preprint. https://doi.org/10.48550/arXiv.2412.19437

Tsai, H. C., Huang, Y. F., Kuo, C. W. (2024). Comparative analysis of automatic literature review using mistral large language model and human reviewers. Research Square. https://doi.org/10.21203/rs.3.rs-4022248/v1

Yang, A., Yu, B., Li, C., Liu, D., Huang, F., Huang, H., ... Zhang, Z. (2025). Qwen2. 5-1m technical report. arXiv preprint. https://doi.org/10.48550/arXiv.2501.15383

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., … Amodei, D. (2020). Language Models are Few-Shot Learners. 34th International Conference on Neural Information Processing Systems. Vancouver BC, Canada https://dl.acm.org/doi/abs/10.5555/3495724.3495883

Martino, A., Iannelli, M., Truong, C. (2023). Knowledge injection to counter large language model (LLM) hallucination. European Semantic Web Conference. Hersonissos, Greece. https://doi.org/10.1007/978-3-031-43458-7_34

Di Rocco, J., Di Ruscio, D., Di Sipio, C., Nguyen, P. T., Rubei, R. (2025). On the use of large language models in model-driven engineering: Software and Systems Modeling, 24, 923-948. https://doi.org/10.1007/s10270-025-01263-8

Li, C., Liu, Z., Xiao, S., Shao, Y. (2023). Making Large Language Models A Better Foundation For Dense Retrieval. arXiv preprint. http://arxiv.org/abs/2312.15503

Chen, J., Xiao, S., Zhang, P., Luo, K., Lian, D., Liu, Z. (2024). BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation. arXiv preprint. http://arxiv.org/abs/2402.03216

Meng, R., Liu, Y., Joty, S. R., Xiong, C., Zhou, Y., Yavuz, S. (2024). SFR-Embedding-Mistral: Enhance Text Retrieval with Transfer Learning. https://www.salesforce.com/blog/sfr-embedding/

De Bari, D., Garaccione, G., Coppola, R., Torchiano, M., Ardito, L. (2024). Evaluating Large Language Models in Exercises of UML Class Diagram Modeling. 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement. Barcelona, España. https://doi.org/10.1145/3674805.3690741

Ferrari, A., Abualhaijal, S., Arora, C. (2024). Model generation with LLMs: From requirements to UML sequence diagrams. IEEE 32nd International Requirements Engineering Conference Workshops (REW). Reykjavik, Iceland. https://doi.org/10.1109/REW61692.2024.00044

PlantUML. (2025). PlantUML: Class diagrams. https://real-world-plantuml.com/?type=class

Lin, C. Y. (2004). Rouge: A package for automatic evaluation of summaries https://aclanthology.org/W04-1013/

Lin, C. Y., Och, F. J. (2004). Looking for a few good metrics: ROUGE and its evaluation https://research.nii.ac.jp/ntcir/ntcir-ws4/NTCIR4-WN/OPEN/OPENSUB_Chin-Yew_Lin.pdf

Lin, S., Wang, B., Huang, Z., Li, C. (2024). A Study on a Precision Geriatric Medical Knowledge Q&A Model Based on Retrieval-Augmented Generation. Preprints. https://www.preprints.org/manuscript/202412.2424/v1

Liu, Y., Huang, L., Li, S., Chen, S., Zhou, H., Meng, F., ... Sun, X. (2023). Recall: A benchmark for llms robustness against external counterfactual knowledge. arXiv preprint. https://doi.org/10.48550/arXiv.2311.08147

Rodríguez Limón, C. (2025). Aumento de Datos Basado en Recursos Lingüísticos para RAG sobre Textos Legales en Español [Trabajo de Fin de Grado]. Universidad Politécnica de Madrid, España. https://oa.upm.es/87925/1/TFG_CARLOS_RODRIGUEZ_LIMON.pdf

Comparación del desempeño de LLMs en la generación de diagramas de clases UML mediante un sistema RAG

Autores/as

DOI:

Palabras clave:

Resumen

Citas

Descargas

Publicado

Cómo citar

Número

Sección

Licencia

Artículos más leídos del mismo autor/a

Enviar un artículo

Idioma

Información