Fine-Tuning T5-Small Model for a Completely New Language: Limbu

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • MyrinNew
    Senior Member
    • Feb 2024
    • 5175

    #1

    Fine-Tuning T5-Small Model for a Completely New Language: Limbu

    Introduction

    Natural Language Processing (NLP) is expanding its reach into underserved languages. In this blog, we’ll explore how to fine-tune the T5-Small model to translate between English and Limbu, a Tibeto-Burman language spoken in Nepal and neighboring regions.





    Preparing the Data

    We created an English-Limbu translation dataset in JSON format, containing over 1,500 pairs. Below is a sample of the data:






    [
    {
    "id": 1,
    "translation": {
    "en": "hi",
    "lim": "ᤜᤠᤤ ॥"
    }
    },
    {
    "id": 2,
    "translation": {
    "en": "Let's eat.",
    "lim": "ᤀᤠᤏᤡ᤹ ᤆᤠᤶ ॥ "
    }
    },
    {
    "id": 3,
    "translation": {
    "en": "We saw it.",
    "lim": "ᤀᤏᤡᤃᤧ ᤁᤴ ᤏᤡᤔᤠᤏᤠ ॥ "
    }
    },
    ...
    ]







    The dataset was saved as limbu-english.json.





    Setting Up the Environment

    Install the required libraries in Google Colab:






    !pip install transformers datasets evaluate sacrebleu
    !pip install transformers[sentencepiece]
    !pip install sentencepiece







    Load the dataset:






    from datasets import load_dataset

    path = 'limbu-english.json'
    translations = load_dataset('json', data_files=path)
    translations = translations["train"].train_test_split(test_size=0.2)










    Loading the Pretrained Model

    We initialized the T5-Small model:






    from transformers import AutoTokenizer, TFAutoModelForSeq2SeqLM

    checkpoint = "t5-small"
    tokenizer = AutoTokenizer.from_pretrained(checkpoint)
    model = TFAutoModelForSeq2SeqLM.from_pretrained(checkpoint )










    Tokenizing the Dataset

    We generated a custom tokenizer and tokenized the dataset:






    def get_training_corpus():
    dataset = translations["train"]
    for start_idx in range(0, len(dataset), 1000):
    yield [item['lim'] for item in dataset[start_idx:start_idx + 1000]['translation']]

    lim_tokenizer = tokenizer.train_new_from_iterator(get_training_cor pus(), 52000)

    source_lang = "en"
    target_lang = "lim"
    prefix = "translate English to Limbu: "

    def preprocess_function(examples):
    inputs = [prefix + example[source_lang] for example in examples["translation"]]
    targets = [example[target_lang] for example in examples["translation"]]
    return lim_tokenizer(inputs, text_target=targets, max_length=128, truncation=True)

    tokenized_translations = translations.map(preprocess_function, batched=True)










    Preparing for Training

    The tokenized data was prepared for the TensorFlow model:






    from transformers import DataCollatorForSeq2Seq

    data_collator = DataCollatorForSeq2Seq(tokenizer=lim_tokenizer, model=checkpoint)

    tf_train_set = model.prepare_tf_dataset(
    tokenized_translations["train"],
    shuffle=True,
    batch_size=16,
    collate_fn=data_collator,
    )

    tf_test_set = model.prepare_tf_dataset(
    tokenized_translations["test"],
    shuffle=False,
    batch_size=16,
    collate_fn=data_collator,
    )










    Training the Model

    We used AdamWeightDecay for optimization:






    from transformers import AdamWeightDecay
    import tensorflow as tf

    optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)
    model.compile(optimizer=optimizer)








    Let's define the metrics to observe while training






    from transformers.keras_callbacks import KerasMetricCallback

    def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
    preds = preds[0]
    decoded_preds = lim_tokenizer.batch_decode(preds, skip_special_tokens=True)

    labels = np.where(labels != -100, labels, lim_tokenizer.pad_token_id)
    decoded_labels = lim_tokenizer.batch_decode(labels, skip_special_tokens=True)

    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    result = {"bleu": result["score"]}

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    result = {k: round(v, 4) for k, v in result.items()}
    return result

    metric_callback = KerasMetricCallback(metric_fn=compute_metrics, eval_dataset=tf_test_set)







    These metrics can be seen in log file as well, but instead we will store it into the huggingface






    from huggingface_hub import notebook_login

    notebook_login()







    and push the logs into huggingface as






    from transformers.keras_callbacks import PushToHubCallback
    push_to_hub_callback = PushToHubCallback(output_dir="eng-limbu-t5-001", tokenizer=lim_tokenizer)

    callbacks = [push_to_hub_callback, tf.keras.callbacks.EarlyStopping(monitor='val_loss ', patience=10)]
    history = model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=500, callbacks=callbacks)










    Visualizing Training Progress

    We visualized the training loss:






    import matplotlib.pyplot as plt

    plt.plot(history.history['loss'])
    plt.plot(history.history['val_loss'])
    plt.ylabel('Loss')
    plt.xlabel('Epoch')
    plt.legend(['Train', 'Test'], loc='upper left')
    plt.show()










    Testing the Model

    We tested the model using the pipeline module:






    from transformers import pipeline

    translator = pipeline("text2text-generation", model="bedus-creation/eng-limbu-t5-001")
    result = translator("translate English to Limbu: Hello")
    print(result)










    Evaluating with BLEU Score

    Finally, we calculated the BLEU score for translation accuracy:






    bleu = evaluate.load("bleu")

    predictions = [
    "Hi",
    ]
    references = [
    ["ᤜᤠᤤ ॥"],
    ]

    results = bleu.compute(predictions=predictions, references=references)

    print(results)










    Conclusion

    Fine-tuning the T5-Small model for Limbu demonstrates the potential of NLP in preserving and advancing underrepresented languages. With more training data and optimization, such models can become invaluable tools for language preservation and cross-cultural communication.




    More...
Working...