Reputation: 807
I have a situation where I am trying to using the pre-trained hugging-face models to translate a pandas column of text from Dutch to English. My input is simple:
Dutch_text
Hallo, het gaat goed
Hallo, ik ben niet in orde
Stackoverflow is nuttig
I am using the below code to translate the above column and I want to store my result into a new column ENG_Text. So the output will look like this:
ENG_Text
Hello, I am good
Hi, I'm not okay
Stackoverflow is helpful
The code that I am using is as follows:
#https://huggingface.co/Helsinki-NLP for other pretrained models
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-nl-en")
model = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-nl-en")
input_1 = df['Dutch_text']
input_ids = tokenizer("translate English to Dutch: "+input_1, return_tensors="pt").input_ids # Batch size 1
outputs = model.generate(input_ids)
decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(decoded)
Any help would be appreciated!
Upvotes: 0
Views: 1514
Reputation: 11213
This is not how an MT model is supposed to be used. It is not a GPT-like experiment to test if the model can understand instruction. It is a translation model that only can translate, there is no need to add the instruction "translate English to Dutch"
. (Don't you want to translate the other way round?)
Also, the translation models are trained to translate sentence by sentence. If you concatenate all sentences from the column, it will be treated as a single sentence. You need to either:
Iterate over the column and translate each sentence independently.
Split the column into batches, so you can parallelize the translation. Note that in that case, you need to pad the sentences in the batches to have the same length. The easiest way to do it is by using the batch_encode_plus
method of the tokenizer.
Upvotes: 1