Zero-Shot Learning For Multilingual Document Classification In Low-Resource Languages

Nasios Orinos; Quedevo Onola; Ong Ben  Chistoff

doi:10.51903/jtie.v4i3.446

Authors

Nasios Orinos Cyprus College, Nicosia, Cyprus
Quedevo Onola Cyprus College, Nicosia, Cyprus
Ong Ben Chistoff UNITAR University College, Kuala Lumpur, Malaysia

DOI:

https://doi.org/10.51903/jtie.v4i3.446

Keywords:

Zero-shot learning (ZSL), Multilingual document classification, Low-resource languages, XLM-RoBERTa (XLM-R), Multilingual T5 (mT5)

Abstract

Document classification in low-resource languages remains a critical challenge due to the scarcity of annotated datasets, language-specific resources, and linguistic tools. This study investigates the effectiveness of zero-shot learning (ZSL) for multilingual document classification, with a specific focus on low-resource Southeast Asian languages: Javanese, Sundanese, and Malay. We adopt a zero-shot cross-lingual transfer approach, using English-labeled data as the source domain and evaluating on unseen target-language documents without any supervised fine-tuning. Specifically, we employ two state-of-the-art multilingual transformer models, XLM-RoBERTa (XLM-R) and Multilingual T5 (mT5), to evaluate their ability to generalize across linguistically distant languages. Experimental results show that XLM-R achieves higher average accuracy (≈78%) and F1 Score (≈0.76) than mT5 (≈74% accuracy, 0.72 F1), demonstrating stronger transferability and stability. Both models exhibit efficient inference speed and manageable computational costs, indicating potential for deployment in resource-constrained environments. The findings introduce an early benchmark for zero-shot multilingual document classification in Southeast Asian languages and highlight the feasibility of inclusive NLP systems that bridge the data gap for underrepresented linguistic communities.

References

Abadi, V. N. M., & Ghasemian, F. (2025). Enhancing Persian text summarization through a three-phase fine-tuning and reinforcement learning approach with the mT5 transformer model. Scientific Reports, 15(1). https://doi.org/10.1038/s41598-024-78235-3

Abadji, J., Suarez, P. O., Romary, L., & Sagot, B. (2022). Towards a Cleaner Document-Oriented Multilingual Crawled Corpus. http://arxiv.org/abs/2201.06642

Avram, A.-M., Pais, V., & Tufis, D. (2021). PyEuroVoc: A Tool for Multilingual Legal Document Classification with EuroVoc Descriptors. http://arxiv.org/abs/2108.01139

Bansal, R., Choudhary, H., Punia, R., Schenk, N., Dahl, J. L., & Pagé-Perron, É. (2021). How Low is Too Low? A Computational Perspective on Extremely Low-Resource Languages. http://arxiv.org/abs/2105.14515

Barbieri, F., Anke, L. E., & Camacho-Collados, J. (2022). XLM-T: Multilingual Language Models in Twitter for Sentiment Analysis and Beyond. http://arxiv.org/abs/2104.12250

Chalkidis, I., Fergadiotis, M., & Androutsopoulos, I. (2021). MultiEURLEX -- A multi-lingual and multi-label legal document classification dataset for zero-shot cross-lingual transfer. http://arxiv.org/abs/2109.00904

Chen, J., Geng, Y., Chen, Z., Pan, J. Z., He, Y., Zhang, W., Horrocks, I., & Chen, H. (2022). Zero-shot and Few-shot Learning with Knowledge Graphs: A Comprehensive Survey. http://arxiv.org/abs/2112.10006

Chi, Z., Huang, S., Dong, L., Ma, S., Zheng, B., Singhal, S., Bajaj, P., Song, X., Mao, X.-L., Huang, H., & Wei, F. (2022). XLM-E: Cross-lingual Language Model Pre-training via ELECTRA. http://arxiv.org/abs/2106.16138

Dang, T. A., Raviv, L., & Galke, L. (2024). Tokenization and Morphology in Multilingual Language Models: A Comparative Analysis of mT5 and ByT5. http://arxiv.org/abs/2410.11627

Edman Gabriele Sarti Antonio Toral Gertjan van Noord Arianna Bisazza, L. (2024). Are Character-level Translations Worth the Wait? Comparing ByT5 and mT5 for Machine Translation. https://doi.org/10.1162/tacl

Farahani, M., Gharachorloo, M., & Manthouri, M. (2020). Leveraging ParsBERT and Pretrained mT5 for Persian Abstractive Text Summarization. https://doi.org/10.1109/CSICC52343.2021.9420563

García-Ferrero, I., Agerri, R., Salazar, A. A., Cabrio, E., de la Iglesia, I., Lavelli, A., Magnini, B., Molinet, B., Ramirez-Romero, J., Rigau, G., Villa-Gonzalez, J. M., Villata, S., & Zaninello, A. (2024). Medical mT5: An Open-Source Multilingual Text-to-Text LLM for The Medical Domain. http://arxiv.org/abs/2404.07613

Ghafoor, A., Imran, A. S., Daudpota, S. M., Kastrati, Z., Abdullah, Batra, R., & Wani, M. A. (2021). The Impact of Translating Resource-Rich Datasets to Low-Resource Languages through Multi-Lingual Text Processing. IEEE Access, 9, 124478–124490. https://doi.org/10.1109/ACCESS.2021.3110285

Goyal, N., Du, J., Ott, M., Anantharaman, G., & Conneau, A. (2021). Larger-Scale Transformers for Multilingual Masked Language Modeling. http://arxiv.org/abs/2105.00572

Guo, M., Han, Z., Kong, L., Zhang, Z., Li, Z., Chen, H., & Qi, H. (2022). Advantages of XLM-R Model for Urdu Sentiment Multi-Classification. https://en.wikipedia.org/wiki/Urdu

Han, A., & Cai, Z. (2023). Design implications of generative AI systems for visual storytelling for young learners. Proceedings of IDC 2023 - 22nd Annual ACM Interaction Design and Children Conference: Rediscovering Childhood, 470–474. https://doi.org/10.1145/3585088.3593867

Han, Z., Fu, Z., Chen, S., & Yang, J. (2021). Contrastive Embedding for Generalized Zero-Shot Learning. https://github.com/Hanzy1996/CE-GZSL.

Hangya, V., Saadi, H. S., & Fraser, A. (2022). Improving Low-Resource Languages in Pre-Trained Multilingual Language Models. https://cistern.cis.lmu.de/lowresCCWR

Hao, L. W., & Liu, R. K. (2025). Transfer Learning Approach for Sentiment Analysis in Low-Resource Austronesian Languages Using Multilingual BERT. Journal of Technology Informatics and Engineering, 4(1), 75–94. https://doi.org/10.51903/jtie.v4i1.276

Howard, A., Bouchon-Meunier, B., IEEE CIS, inversion, Lei, J., Lynn@Vesta, Marcus2010, & Abbass, H. (2019). IEEE-CIS Fraud Detection | Kaggle. Kaggle.

Kargaran, A. H., Imani, A., Yvon, F., & Schütze, H. (2024). GlotLID: Language Identification for Low-Resource Languages. https://doi.org/10.18653/v1/2023.findings-emnlp.410

Kumar, D., Sarangi, P. K., & Verma, R. (2020). A systematic review of stock market prediction using machine learning and statistical techniques. Materials Today: Proceedings, 49, 3187–3191. https://doi.org/10.1016/j.matpr.2020.11.399

Liang, D., Gonen, H., Mao, Y., Hou, R., Goyal, N., Ghazvininejad, M., Zettlemoyer, L., & Khabsa, M. (2023). XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models. http://arxiv.org/abs/2301.10472

Lopez-Rojas, Alonso, E., Elmir, Ahmad, Axelsson, & Stefan. (2016). PaySim: A financial mobile money simulator for fraud detection | Kaggle. Kaggle. https://www.kaggle.com/datasets/ealaxi/paysim1

Mancini, M., Ferjad Naeem, M., Xian, Y., & Akata, Z. (2021). Open World Compositional Zero-Shot Learning. https://github.com/ExplainableML/czsl.

Pakray, P., Gelbukh, A., & Bandyopadhyay, S. (2025). Natural language processing applications for low-resource languages. Natural Language Processing, 31(2), 183–197. https://doi.org/10.1017/nlp.2024.33

Pilat, D., Paumier, J. M., García-González, L., Louis, L., Stephan, D., Manrique, C., Khrestchatisky, M., Di Pasquale, E., Baranger, K., & Rivera, S. (2022). MT5-MMP promotes neuroinflammation, neuronal excitability and Aβ production in primary neuron/astrocyte cultures from the 5xFAD mouse model of Alzheimer’s disease. Journal of Neuroinflammation, 19(1). https://doi.org/10.1186/s12974-022-02407-z

Puri, S., Janarthanan, M., & Khekare, G. (2025). Multilingual Document Classification using XAI: A Review. In SGS Engineering & Sciences (Vol. 1, Issue 1). https://spast.org/index.php/techrep/index

Ranathunga, S., Lee, E.-S. A., Skenduli, M. P., Shekhar, R., Alam, M., & Kaur, R. (2021). Neural Machine Translation for Low-Resource Languages: A Survey. http://arxiv.org/abs/2106.15115

Ri Shin, N., Kim, T., Yeol Yun, D., Moon, S.-J., & Hwang, C. (2021). Sentiment analysis of Korean movie reviews using XLM-R 1. International Journal of Advanced Culture Technology, 9(2), 86–90. https://doi.org/10.17703/IJACT.2021.9.2.86

Robinson, N. R., Ogayo, P., Mortensen, D. R., & Neubig, G. (2023). ChatGPT MT: Competitive for High- (but not Low-) Resource Languages. http://arxiv.org/abs/2309.07423

Santoso, J. T., & Yan, S. (2024). A Hybrid Approach to Typo Correction in Indonesian Documents Using Levenshtein Distance. Journal of Technology Informatics and Engineering, 3(2), 151–168. https://doi.org/10.51903/jtie.v3i2.184

van der Heijden, N., Yannakoudakis, H., Mishra, P., & Shutova, E. (2021). Multilingual and cross-lingual document classification: A meta-learning approach. http://arxiv.org/abs/2101.11302

Wang, C., & Banko, M. (2021). Practical Transformer-based Multilingual Text Classification. https://cloud.google.com/translate

Wei, J., Bosma, M., Zhao, V. Y., Guu, K., Yu, A. W., Lester, B., Du, N., Dai, A. M., & Le, Q. V. (2022). Finetuned Language Models Are Zero-Shot Learners. http://arxiv.org/abs/2109.01652

Ye, J., Gao, J., Li, Q., Xu, H., Feng, J., Wu, Z., Yu, T., & Kong, L. (2022). ZeroGen: Efficient Zero-shot Learning via Dataset Generation. http://arxiv.org/abs/2202.07922

Yong, Z.-X., Menghini, C., & Bach, S. H. (2024). Low-Resource Languages Jailbreak GPT-4. http://arxiv.org/abs/2310.02446

Zhong, T., Yang, Z., Liu, Z., Zhang, R., Liu, Y., Sun, H., Pan, Y., Li, Y., Zhou, Y., Jiang, H., Chen, J., & Liu, T. (2024). Opportunities and Challenges of Large Language Models for Low-Resource Languages in Humanities Research. http://arxiv.org/abs/2412.04497

Zero-Shot Learning For Multilingual Document Classification In Low-Resource Languages

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

Issue

Section

License

How to Cite

full sidebar