A Comparative Study of MBart and Alternative Transformer Models for Kumauni Language Translation
Neelesh Kumar Tanwar
IIT Kharagpur, India.
Atul Joshi *
Graphic Era Hill University, Bhimtal Campus, India.
Ankur Singh Bist
Graphic Era Hill University, Bhimtal Campus, India.
*Author to whom correspondence should be addressed.
Abstract
The archiving and computational treatment of so-called low-resource language sets pose daunting challenges for NLP. This research look into applying the latest and greatest multilingual transformer architectures for the Kumaoni translation machine, Kumaoni being an Indo-Aryan language spoken in Northern India and, therefore, problematic from a digital resource point of view. Because of the closeness among Kumaoni and Hindi, Hindi is used as a proxy for training the model and for transferring the model, which makes for a major methodological consideration. Performance of MBart (Multilingual Denoising Pre-training for Neural Machine Translation) is tested against other transformer models, MarianMT and mT5, using a custom parallel dataset with roughly [insert dataset size] sentence pairs. The various evaluation metrics employed are BLEU, ROUGE-L, and TER. Results show that MBart performs better than baselines in BLEU, with an absolute gain of 2.45 points over MarianMT and almost 4 points over mT5. Although MBart outperforms the baseline systems in BLEU score, it is expected that its fluency and degree of error reduction will still be improved through additional experiments with larger datasets and further fine-tuning. These developments have shown that multilingual pre-training and cross-lingual transfer hold promise for low-resource translation techniques and introduce a replicable framework intended to further NLP for other poorly resourced languages.
Keywords: Machine translation, MBart, alternative transformer models, kumauni language translation