Reputation: 85
I'm a newbie going through the hugging face library trying out the Translation models for a data entry task and translating text from English to Italian.
The code I tried based on the documentation:
from transformers import MarianTokenizer, MarianMTModel
from typing import List
#src = 'en' # source language
#trg = 'it' # target language
#saved the model locally.
#model_name = f'Helsinki-NLP/opus-mt-{src}-{trg}'
#model.save_pretrained("./model_en_to_it")
#tokenizer.save_pretrained("./tokenizer_en_to_it")
model = MarianMTModel.from_pretrained('./model_en_to_it')
tokenizer = MarianTokenizer.from_pretrained('./tokenizer_en_to_it')
#Next, trying to iterate over each column - 'english_text' of the dataset and
#translate the text from English to Italian and append the translated text to the
#list 'italian'.
italian = []
for i in range(len(data)):
batch = tokenizer(dataset['english_text'][i],
return_tensors="pt",truncation=True,
padding = True)
gen = model.generate(**batch)
italian.append(tokenizer.batch_decode(gen, skip_special_tokens=True))
Two concerns over here:
Even if any one of the problem could be solved, that's helpful. Would love to learn
Upvotes: 1
Views: 1041
Reputation: 11213
Virtually all current MT systems are trained using single sentences, not paragraphs. If your input text is in paragraphs, you need to do sentence splitting first. Any NLP library will do (e.g., NLTK, Spacy, Stanza). Having multiple sentences in a single input will lead to worse translation quality (because this is not what the model was trained for). Moreover, the complexity of the Transformer model is quadratic with respect to the input length (it does not fully hold when everything is parallelized on a GPU), so it gets very slow with very long inputs.
Upvotes: 3