Fine-Tuning T5-Small Model for a Completely New Language: Limbu

**MyrinNew** · 12-20-2024, 08:02 PM

Introduction

Natural Language Processing (NLP) is expanding its reach into underserved languages. In this blog, we’ll explore how to fine-tune the T5-Small model to translate between English and Limbu, a Tibeto-Burman language spoken in Nepal and neighboring regions.

Preparing the Data

We created an English-Limbu translation dataset in JSON format, containing over 1,500 pairs. Below is a sample of the data:

[
{
"id": 1,
"translation": {
"en": "hi",
"lim": "ᤜᤠᤤ ॥"
}
},
{
"id": 2,
"translation": {
"en": "Let's eat.",
"lim": "ᤀᤠᤏᤡ᤹ ᤆᤠᤶ ॥ "
}
},
{
"id": 3,
"translation": {
"en": "We saw it.",
"lim": "ᤀᤏᤡᤃᤧ ᤁᤴ ᤏᤡᤔᤠᤏᤠ ॥ "
}
},
...
]

The dataset was saved as limbu-english.json.

Setting Up the Environment

Install the required libraries in Google Colab:

!pip install transformers datasets evaluate sacrebleu
!pip install transformers[sentencepiece]
!pip install sentencepiece

Load the dataset:

from datasets import load_dataset

path = 'limbu-english.json'
translations = load_dataset('json', data_files=path)
translations = translations["train"].train_test_split(test_size=0.2)

Loading the Pretrained Model

We initialized the T5-Small model:

from transformers import AutoTokenizer, TFAutoModelForSeq2SeqLM

checkpoint = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = TFAutoModelForSeq2SeqLM.from_pretrained(checkpoint )

Tokenizing the Dataset

We generated a custom tokenizer and tokenized the dataset:

def get_training_corpus():
dataset = translations["train"]
for start_idx in range(0, len(dataset), 1000):
yield [item['lim'] for item in dataset[start_idx:start_idx + 1000]['translation']]

lim_tokenizer = tokenizer.train_new_from_iterator(get_training_cor pus(), 52000)

source_lang = "en"
target_lang = "lim"
prefix = "translate English to Limbu: "

def preprocess_function(examples):
inputs = [prefix + example[source_lang] for example in examples["translation"]]
targets = [example[target_lang] for example in examples["translation"]]
return lim_tokenizer(inputs, text_target=targets, max_length=128, truncation=True)

tokenized_translations = translations.map(preprocess_function, batched=True)

Preparing for Training

The tokenized data was prepared for the TensorFlow model:

from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=lim_tokenizer, model=checkpoint)

tf_train_set = model.prepare_tf_dataset(
tokenized_translations["train"],
shuffle=True,
batch_size=16,
collate_fn=data_collator,
)

tf_test_set = model.prepare_tf_dataset(
tokenized_translations["test"],
shuffle=False,
batch_size=16,
collate_fn=data_collator,
)

Training the Model

We used AdamWeightDecay for optimization:

from transformers import AdamWeightDecay
import tensorflow as tf

optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)
model.compile(optimizer=optimizer)

Let's define the metrics to observe while training

from transformers.keras_callbacks import KerasMetricCallback

def compute_metrics(eval_preds):
preds, labels = eval_preds
if isinstance(preds, tuple):
preds = preds[0]
decoded_preds = lim_tokenizer.batch_decode(preds, skip_special_tokens=True)

labels = np.where(labels != -100, labels, lim_tokenizer.pad_token_id)
decoded_labels = lim_tokenizer.batch_decode(labels, skip_special_tokens=True)

decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

result = metric.compute(predictions=decoded_preds, references=decoded_labels)
result = {"bleu": result["score"]}

prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
result["gen_len"] = np.mean(prediction_lens)
result = {k: round(v, 4) for k, v in result.items()}
return result

metric_callback = KerasMetricCallback(metric_fn=compute_metrics, eval_dataset=tf_test_set)

These metrics can be seen in log file as well, but instead we will store it into the huggingface

from huggingface_hub import notebook_login

notebook_login()

and push the logs into huggingface as

from transformers.keras_callbacks import PushToHubCallback
push_to_hub_callback = PushToHubCallback(output_dir="eng-limbu-t5-001", tokenizer=lim_tokenizer)

callbacks = [push_to_hub_callback, tf.keras.callbacks.EarlyStopping(monitor='val_loss ', patience=10)]
history = model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=500, callbacks=callbacks)

Visualizing Training Progress

We visualized the training loss:

import matplotlib.pyplot as plt

plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='upper left')
plt.show()

Testing the Model

We tested the model using the pipeline module:

from transformers import pipeline

translator = pipeline("text2text-generation", model="bedus-creation/eng-limbu-t5-001")
result = translator("translate English to Limbu: Hello")
print(result)

Evaluating with BLEU Score

Finally, we calculated the BLEU score for translation accuracy:

bleu = evaluate.load("bleu")

predictions = [
"Hi",
]
references = [
["ᤜᤠᤤ ॥"],
]

results = bleu.compute(predictions=predictions, references=references)

print(results)

Conclusion

Fine-tuning the T5-Small model for Limbu demonstrates the potential of NLP in preserving and advancing underrepresented languages. With more training data and optimization, such models can become invaluable tools for language preservation and cross-cultural communication.

More...