Architecture of a machine translator for mixteco: a specific approach for indigenous languages with scarce linguistic resources

Authors

DOI:

https://doi.org/10.36825/RITI.12.28.007

Keywords:

Indigenous Language Revitalization, Machine Translation, Transformer, Scarce Linguistic Resources, Mixtec

Abstract

The revitalization of indigenous languages is of utmost importance to preserve the cultural diversity and ancestral knowledge they represent. Currently, technologies such as artificial intelligence (AI) offer unprecedented opportunities to support these efforts through machine translation (MT) and neural networks (NR). This research work focuses on detailing the architecture of a machine translator for indigenous languages with scarce linguistic resources. The proposal considers the transformer neural network model, which uses a parallel corpus and automatic alignment techniques. To evaluate the quality of the translations, a functional architecture is implemented by means of a translation system from Spanish to Mixtec, where the main metric is BLEU. For this purpose, a knowledge corpus with a grammatical richness of 1.1K words and phrases from the Mixtec language was prepared. The translation quality results indicate a low quality of translation from Spanish to Mixtec. It is considered that this performance is still limited due to the small size of the corpus so it is concluded with interesting proposals to improve the quality of machine translation.

References

Schmelkes, S. (2022). El papel que ha jugado la escuela respecto a las culturas y las lenguas indígenas. En A. L. Gallardo, C. Rosa (Coord.) Epistemologías e Interculturalidad en educación (pp. 195-206). IISUE, UNAM. https://www.iisue.unam.mx/publicaciones/libros/epistemologias-e-interculturalidad-en-educacion

Chakravarthi, B. R., Rani, P., Arcan, M., McCrae, J. P. (2021). A Survey of Orthographic Information in Machine Translation. SN Computer Science, 2 (4), 1-19. https://doi.org/10.1007/s42979-021-00723-4

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., Polosukhin, I. (2017). Attention Is All You Need. 31st Conference on Neural Information Processing Systems (NIPS). Long Beach California, USA. https://arxiv.org/pdf/1706.03762v5

Mager, M., Meza, I. (2021). Retos en construcción de traductores automáticos para lenguas indígenas de México. Digital Scholarship in the Humanities, 36, (Supplement_1), i43–i48. https://doi.org/10.1093/llc/fqz093

DOF. (2024). Diario Oficial de la Federación. https://www.dof.gob.mx/nota_detalle.php?codigo=5343116&fecha=30/04/2014#gsc.tab=0

Xu, B. (2022). English-Chinese Machine Translation Based on Transfer Learning and Chinese-English Corpus. Computational Intelligence and Neuroscience, 2022 (1), 1-10. https://doi.org/10.1155/2022/1563731

Zacarías, Z., Meza, I. (2021). Ayuuk-Spanish Neural Machine Translator. https://github.com/anoidgit/yasa

Feldman, I., Coto-Solano, R. (2020). Neural Machine Translation Models with Back-Translation for the Extremely Low-Resource Indigenous Language Bribri. 28th International Conference on Computational Linguistics. Stroudsburg, PA, USA. https://doi.org/10.18653/v1/2020.coling-main.351

Huarcaya Taquiri, D. (2020). Traducción automática neuronal para lengua nativa peruana [Tesis de Grado]. Universidad Peruana Unión, Lima Perú. http://repositorio.upeu.edu.pe/handle/20.500.12840/4143

INLI. (2022). Norma de escritura del Tu’un Savi (idioma mixteco). Versión en español. https://site.inali.gob.mx/INALIDhuchlab/assets/files/NormaTuunSavi-Espanol.pdf

Blancarte Iturralde, O. J. (2020). Introducción a la arquitectura de software. Un enfoque práctico. https://pdfcoffee.com/introduccion-a-la-rquitectura-de-softwarepdf-2-pdf-free.html

Le, N. T., Sadat, F. (2020). Revitalization of Indigenous Languages through Pre-processing and Neural Machine Translation: The case of Inuktitut. https://github.com/huggingface/transformers

Baker, T., Smith, L., Anissa, N. (2019). Educ-AI-tion rebooted? Exploring the future of artificialx intelligence in schools and colleges. https://media.nesta.org.uk/documents/Future_of_AI_and_education_v5_WEB.pdf

López Takeyas, B. (2007). Introduccion a la Inteligencia. https://nlaredo.tecnm.mx/takeyas/Articulos/Inteligencia%20Artificial/ARTICULO%20Introduccion%20a%20la%20Inteligencia%20Artificial.pdf

Casacuberta Nolla, F., Peris Abril, Á. (2017). Traducción automática neuronal. Revista Tradumàtica: tecnologies de la traducció, (15), 66-74. https://doi.org/10.5565/rev/tradumatica.203

Thinh Nguyen, T. (2019). Machine Translation with Transformers [Tesis de Maestría]. Universit¨at Stuttgart, Alemania. https://elib.uni-stuttgart.de/bitstream/11682/10621/1/thesis.pdf

Barcena, M. E., Gamal Othman, M. (2017). Traducción Automática y Asistida Por Ordenador. https://www.researchgate.net/publication/323273904

Microsoft. (2024). ¿Qué es una puntuación BLEU? https://learn.microsoft.com/es-es/azure/ai-services/translator/custom-translator/concepts/bleu-score

Urban, E. (2024). Desafío de aptitudes de IA. https://learn.microsoft.com/es-es/azure/ai-services/translator/custom-translator/how-to/test-your-model

Basit Andrabi, S. A., Wahid, A. (2022). Machine Translation System Using Deep Learning for English to Urdu. Computional Intelligence Neuroscience, 2022, 1-11. https://doi.org/10.1155/2022/7873012

Moreno Cabrera, J. C. (2014). El español hablado como lengua aglutinante y polisintética. https://www.academia.edu/31414173/EL_ESPA%C3%91OL_HABLADO_COMO_LENGUA_AGLUTINANTE_Y_POLISINT%C3%89TICA

Nekoto, W., Marivate, V., Matsila, T., Fasubaa, T., Fagbohungbe, T., Oluwole Akinola, S., Muhammad, S., Kabongo Kabenamualu, S., Osei, S., Sackey, F., Niyongabo, R. A., Macharm, R., Ogayo, P., Ahia, O., Berhe, M. M., Adeyemi, M., Mokgesi-Selinga, M., Okegbemi, L., Martinus, L., et al. (2020). Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages. In T. Cohn, Y. He, Y. Liu (Eds.) Findings of the Association for Computational Linguistics: EMNLP 2020 (pp. 2144-2160). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.findings-emnlp.195

Knowles, R., Stewart, D., Larkin, S., Littell, P. (2021). NRC-CNRC Machine Translation Systems for the 2021 AmericasNLP Shared Task. First Workshop on Natural Language Processing for Indigenous Languages of the Americas. Stroudsburg, PA, USA. https://doi.org/10.18653/v1/2021.americasnlp-1.25

Moreno Veliz, O. (2021). The REPU CS’ Spanish–Quechua Submission to the AmericasNLP 2021 Shared Task on Open Machine Translation. 1st Workshop on Natural Language Processing for Indigenous Languages of the Americas. Ciudad de México. https://doi.org/10.18653/v1/2021.americasnlp-1.27

Casado-Vara, R. (2019). Introducción a HTML. En C. Pinzón Trejos (Ed.) Knowledge extraction and representation (pp. 279-506). Ediciones Universidad de Salamanca. http://hdl.handle.net/10366/139647

Gather Consultores. (2023). CSS: Diseñando el mundo web con estilo. https://www.linkedin.com/pulse/css-dise%C3%B1ando-el-mundo-web-con-estilo-gather-consultores-pwhue/

php NET. (2024). ¿Qués e PHP? https://www.php.net/manual/es/intro-whatis.php

Rawat, A. (2020). A Review on Python Programming. International Journal of Research in Engineering, Science and Management, 3 (12), 8–11, https://journal.ijresm.com/index.php/ijresm/article/view/395

Arias, A. (2015). Bases de Datos con MySQL (2da Ed.). Createspace Independent Publishing Platform.

Ponsico Martin, P. (2017). Tecnología de Contenedores Docker [Tesis de Grado]. Escola Tècnica d'Enginyeria de Telecomunicació de Barcelona. Universitat Politècnica de Catalunya. Barcelona, España. https://upcommons.upc.edu/handle/2117/113040

TensorFlow. (2022). Incrustaciones de palabras. https://www.tensorflow.org/text/guide/word_embeddings?hl=es-419

Published

2024-11-25

How to Cite

Bautista Morales , R., Martínez Ramírez, Y., Rocha Peña , L. E., & Montes Santiago, R. E. (2024). Architecture of a machine translator for mixteco: a specific approach for indigenous languages with scarce linguistic resources. Revista De Investigación En Tecnologías De La Información, 12(28 (Especial), 71–81. https://doi.org/10.36825/RITI.12.28.007