Translating text from english to Italian using hugging face Helsinki models not fully translating

Question

I'm a newbie going through the hugging face library trying out the Translation models for a data entry task and translating text from English to Italian.

The code I tried based on the documentation:

from transformers import MarianTokenizer, MarianMTModel
from typing import List

#src = 'en'  # source language
#trg = 'it'  # target language
#saved the model locally.
#model_name = f'Helsinki-NLP/opus-mt-{src}-{trg}'
#model.save_pretrained("./model_en_to_it")
#tokenizer.save_pretrained("./tokenizer_en_to_it")


model = MarianMTModel.from_pretrained('./model_en_to_it')
tokenizer = MarianTokenizer.from_pretrained('./tokenizer_en_to_it')

#Next, trying to iterate over each column - 'english_text' of the dataset and 
#translate the text from English to Italian and append the translated text to the 
#list 'italian'.
 
italian = []
for i in range(len(data)):   
    batch = tokenizer(dataset['english_text'][i], 
                      return_tensors="pt",truncation=True, 
                      padding = True)
    gen = model.generate(**batch)
    italian.append(tokenizer.batch_decode(gen, skip_special_tokens=True))

Two concerns over here:

Translates and appends only partial text i.e., it truncates the paragraph if it exceeds a certain length. How to translate the text given any length?
I have near about 10k data and it is taking a hell of a lot of time.

Even if any one of the problem could be solved, that's helpful. Would love to learn

Translating text from english to Italian using hugging face Helsinki models not fully translating

Answers (1)

Related Questions