Architecture of a machine translator for mixteco: a specific approach for indigenous languages with scarce linguistic resources
DOI:
https://doi.org/10.36825/RITI.12.28.007Keywords:
Indigenous Language Revitalization, Machine Translation, Transformer, Scarce Linguistic Resources, MixtecAbstract
The revitalization of indigenous languages is of utmost importance to preserve the cultural diversity and ancestral knowledge they represent. Currently, technologies such as artificial intelligence (AI) offer unprecedented opportunities to support these efforts through machine translation (MT) and neural networks (NR). This research work focuses on detailing the architecture of a machine translator for indigenous languages with scarce linguistic resources. The proposal considers the transformer neural network model, which uses a parallel corpus and automatic alignment techniques. To evaluate the quality of the translations, a functional architecture is implemented by means of a translation system from Spanish to Mixtec, where the main metric is BLEU. For this purpose, a knowledge corpus with a grammatical richness of 1.1K words and phrases from the Mixtec language was prepared. The translation quality results indicate a low quality of translation from Spanish to Mixtec. It is considered that this performance is still limited due to the small size of the corpus so it is concluded with interesting proposals to improve the quality of machine translation.
References
Schmelkes, S. (2022). El papel que ha jugado la escuela respecto a las culturas y las lenguas indígenas. En A. L. Gallardo, C. Rosa (Coord.) Epistemologías e Interculturalidad en educación (pp. 195-206). IISUE, UNAM. https://www.iisue.unam.mx/publicaciones/libros/epistemologias-e-interculturalidad-en-educacion
Chakravarthi, B. R., Rani, P., Arcan, M., McCrae, J. P. (2021). A Survey of Orthographic Information in Machine Translation. SN Computer Science, 2 (4), 1-19. https://doi.org/10.1007/s42979-021-00723-4
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., Polosukhin, I. (2017). Attention Is All You Need. 31st Conference on Neural Information Processing Systems (NIPS). Long Beach California, USA. https://arxiv.org/pdf/1706.03762v5
Mager, M., Meza, I. (2021). Retos en construcción de traductores automáticos para lenguas indígenas de México. Digital Scholarship in the Humanities, 36, (Supplement_1), i43–i48. https://doi.org/10.1093/llc/fqz093
DOF. (2024). Diario Oficial de la Federación. https://www.dof.gob.mx/nota_detalle.php?codigo=5343116&fecha=30/04/2014#gsc.tab=0
Xu, B. (2022). English-Chinese Machine Translation Based on Transfer Learning and Chinese-English Corpus. Computational Intelligence and Neuroscience, 2022 (1), 1-10. https://doi.org/10.1155/2022/1563731
Zacarías, Z., Meza, I. (2021). Ayuuk-Spanish Neural Machine Translator. https://github.com/anoidgit/yasa
Feldman, I., Coto-Solano, R. (2020). Neural Machine Translation Models with Back-Translation for the Extremely Low-Resource Indigenous Language Bribri. 28th International Conference on Computational Linguistics. Stroudsburg, PA, USA. https://doi.org/10.18653/v1/2020.coling-main.351
Huarcaya Taquiri, D. (2020). Traducción automática neuronal para lengua nativa peruana [Tesis de Grado]. Universidad Peruana Unión, Lima Perú. http://repositorio.upeu.edu.pe/handle/20.500.12840/4143
INLI. (2022). Norma de escritura del Tu’un Savi (idioma mixteco). Versión en español. https://site.inali.gob.mx/INALIDhuchlab/assets/files/NormaTuunSavi-Espanol.pdf
Blancarte Iturralde, O. J. (2020). Introducción a la arquitectura de software. Un enfoque práctico. https://pdfcoffee.com/introduccion-a-la-rquitectura-de-softwarepdf-2-pdf-free.html
Le, N. T., Sadat, F. (2020). Revitalization of Indigenous Languages through Pre-processing and Neural Machine Translation: The case of Inuktitut. https://github.com/huggingface/transformers
Baker, T., Smith, L., Anissa, N. (2019). Educ-AI-tion rebooted? Exploring the future of artificialx intelligence in schools and colleges. https://media.nesta.org.uk/documents/Future_of_AI_and_education_v5_WEB.pdf
López Takeyas, B. (2007). Introduccion a la Inteligencia. https://nlaredo.tecnm.mx/takeyas/Articulos/Inteligencia%20Artificial/ARTICULO%20Introduccion%20a%20la%20Inteligencia%20Artificial.pdf
Casacuberta Nolla, F., Peris Abril, Á. (2017). Traducción automática neuronal. Revista Tradumàtica: tecnologies de la traducció, (15), 66-74. https://doi.org/10.5565/rev/tradumatica.203
Thinh Nguyen, T. (2019). Machine Translation with Transformers [Tesis de Maestría]. Universit¨at Stuttgart, Alemania. https://elib.uni-stuttgart.de/bitstream/11682/10621/1/thesis.pdf
Barcena, M. E., Gamal Othman, M. (2017). Traducción Automática y Asistida Por Ordenador. https://www.researchgate.net/publication/323273904
Microsoft. (2024). ¿Qué es una puntuación BLEU? https://learn.microsoft.com/es-es/azure/ai-services/translator/custom-translator/concepts/bleu-score
Urban, E. (2024). Desafío de aptitudes de IA. https://learn.microsoft.com/es-es/azure/ai-services/translator/custom-translator/how-to/test-your-model
Basit Andrabi, S. A., Wahid, A. (2022). Machine Translation System Using Deep Learning for English to Urdu. Computional Intelligence Neuroscience, 2022, 1-11. https://doi.org/10.1155/2022/7873012
Moreno Cabrera, J. C. (2014). El español hablado como lengua aglutinante y polisintética. https://www.academia.edu/31414173/EL_ESPA%C3%91OL_HABLADO_COMO_LENGUA_AGLUTINANTE_Y_POLISINT%C3%89TICA
Nekoto, W., Marivate, V., Matsila, T., Fasubaa, T., Fagbohungbe, T., Oluwole Akinola, S., Muhammad, S., Kabongo Kabenamualu, S., Osei, S., Sackey, F., Niyongabo, R. A., Macharm, R., Ogayo, P., Ahia, O., Berhe, M. M., Adeyemi, M., Mokgesi-Selinga, M., Okegbemi, L., Martinus, L., et al. (2020). Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages. In T. Cohn, Y. He, Y. Liu (Eds.) Findings of the Association for Computational Linguistics: EMNLP 2020 (pp. 2144-2160). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.findings-emnlp.195
Knowles, R., Stewart, D., Larkin, S., Littell, P. (2021). NRC-CNRC Machine Translation Systems for the 2021 AmericasNLP Shared Task. First Workshop on Natural Language Processing for Indigenous Languages of the Americas. Stroudsburg, PA, USA. https://doi.org/10.18653/v1/2021.americasnlp-1.25
Moreno Veliz, O. (2021). The REPU CS’ Spanish–Quechua Submission to the AmericasNLP 2021 Shared Task on Open Machine Translation. 1st Workshop on Natural Language Processing for Indigenous Languages of the Americas. Ciudad de México. https://doi.org/10.18653/v1/2021.americasnlp-1.27
Casado-Vara, R. (2019). Introducción a HTML. En C. Pinzón Trejos (Ed.) Knowledge extraction and representation (pp. 279-506). Ediciones Universidad de Salamanca. http://hdl.handle.net/10366/139647
Gather Consultores. (2023). CSS: Diseñando el mundo web con estilo. https://www.linkedin.com/pulse/css-dise%C3%B1ando-el-mundo-web-con-estilo-gather-consultores-pwhue/
php NET. (2024). ¿Qués e PHP? https://www.php.net/manual/es/intro-whatis.php
Rawat, A. (2020). A Review on Python Programming. International Journal of Research in Engineering, Science and Management, 3 (12), 8–11, https://journal.ijresm.com/index.php/ijresm/article/view/395
Arias, A. (2015). Bases de Datos con MySQL (2da Ed.). Createspace Independent Publishing Platform.
Ponsico Martin, P. (2017). Tecnología de Contenedores Docker [Tesis de Grado]. Escola Tècnica d'Enginyeria de Telecomunicació de Barcelona. Universitat Politècnica de Catalunya. Barcelona, España. https://upcommons.upc.edu/handle/2117/113040
TensorFlow. (2022). Incrustaciones de palabras. https://www.tensorflow.org/text/guide/word_embeddings?hl=es-419
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2024 Revista de Investigación en Tecnologías de la Información
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
Esta revista proporciona un acceso abierto a su contenido, basado en el principio de que ofrecer al público un acceso libre a las investigaciones ayuda a un mayor intercambio global del conocimiento.
El texto publicado en la Revista de Investigación en Tecnologías de la Información (RITI) se distribuye bajo la licencia Creative Commons (CC BY-NC), que permite a terceros utilizar lo publicado citando a los autores del trabajo y a RITI, pero sin hacer uso del material con propósitos comerciales.