Django0602
Django0602

Reputation: 807

Translating using pre-trained hugging face transformers not working

I have a situation where I am trying to using the pre-trained hugging-face models to translate a pandas column of text from Dutch to English. My input is simple:

Dutch_text             
Hallo, het gaat goed
Hallo, ik ben niet in orde
Stackoverflow is nuttig

I am using the below code to translate the above column and I want to store my result into a new column ENG_Text. So the output will look like this:

ENG_Text             
Hello, I am good
Hi, I'm not okay
Stackoverflow is helpful

The code that I am using is as follows:

#https://huggingface.co/Helsinki-NLP for other pretrained models 
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-nl-en")
model = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-nl-en")
input_1 = df['Dutch_text']
input_ids = tokenizer("translate English to Dutch: "+input_1, return_tensors="pt").input_ids # Batch size 1
outputs = model.generate(input_ids)
decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(decoded)

Any help would be appreciated!

Upvotes: 0

Views: 1514

Answers (1)

Jindřich
Jindřich

Reputation: 11213

This is not how an MT model is supposed to be used. It is not a GPT-like experiment to test if the model can understand instruction. It is a translation model that only can translate, there is no need to add the instruction "translate English to Dutch". (Don't you want to translate the other way round?)

Also, the translation models are trained to translate sentence by sentence. If you concatenate all sentences from the column, it will be treated as a single sentence. You need to either:

  1. Iterate over the column and translate each sentence independently.

  2. Split the column into batches, so you can parallelize the translation. Note that in that case, you need to pad the sentences in the batches to have the same length. The easiest way to do it is by using the batch_encode_plus method of the tokenizer.

Upvotes: 1

Related Questions