Raghavi R
Raghavi R

Reputation: 85

Translating text from english to Italian using hugging face Helsinki models not fully translating

I'm a newbie going through the hugging face library trying out the Translation models for a data entry task and translating text from English to Italian.

The code I tried based on the documentation:

from transformers import MarianTokenizer, MarianMTModel
from typing import List

#src = 'en'  # source language
#trg = 'it'  # target language
#saved the model locally.
#model_name = f'Helsinki-NLP/opus-mt-{src}-{trg}'
#model.save_pretrained("./model_en_to_it")
#tokenizer.save_pretrained("./tokenizer_en_to_it")


model = MarianMTModel.from_pretrained('./model_en_to_it')
tokenizer = MarianTokenizer.from_pretrained('./tokenizer_en_to_it')

#Next, trying to iterate over each column - 'english_text' of the dataset and 
#translate the text from English to Italian and append the translated text to the 
#list 'italian'.
 
italian = []
for i in range(len(data)):   
    batch = tokenizer(dataset['english_text'][i], 
                      return_tensors="pt",truncation=True, 
                      padding = True)
    gen = model.generate(**batch)
    italian.append(tokenizer.batch_decode(gen, skip_special_tokens=True))

Two concerns over here:

  1. Translates and appends only partial text i.e., it truncates the paragraph if it exceeds a certain length. How to translate the text given any length?
  2. I have near about 10k data and it is taking a hell of a lot of time.

Even if any one of the problem could be solved, that's helpful. Would love to learn

Upvotes: 1

Views: 1041

Answers (1)

Jindřich
Jindřich

Reputation: 11213

Virtually all current MT systems are trained using single sentences, not paragraphs. If your input text is in paragraphs, you need to do sentence splitting first. Any NLP library will do (e.g., NLTK, Spacy, Stanza). Having multiple sentences in a single input will lead to worse translation quality (because this is not what the model was trained for). Moreover, the complexity of the Transformer model is quadratic with respect to the input length (it does not fully hold when everything is parallelized on a GPU), so it gets very slow with very long inputs.

Upvotes: 3

Related Questions