Frank
Frank

Reputation: 141

How to get translations of one batch of sentences after batch_encode_plus?

I want to get translations of one batch of sentences using pretrained model.

model = AutoModelWithLMHead.from_pretrained("Helsinki-NLP/opus-mt-es-en")
tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-es-en")
batch_input_str = (("Mary spends $20 on pizza"), ("She likes eating it"), ("The pizza was great"))
encoded = (tokenizer.batch_encode_plus(batch_input_str, pad_to_max_length=True))

The encodedis like:

{'input_ids': [[4963, 10154, 5021, 9, 25, 1326, 2255, 35, 17462, 0], [552, 3996, 2274, 9, 129, 75, 2223, 25, 1370, 0], [42, 17462, 12378, 9, 25, 5807, 1949, 0, 65000, 65000]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 0, 0]]}

Then, should I just pass the encoded to

output = model.generate(a)

And then use

res = tokenizer.decode(output)

?

Thanks!

Upvotes: 2

Views: 1417

Answers (1)

cronoik
cronoik

Reputation: 19520

The model Helsinki-NLP/opus-mt-es-en translates from Spanish to English. Please have a look at the examples below:

# use AutoModelForSeq2SeqLM because AutoModelWithLMHead is deprecated
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-es-en")
tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-es-en")
batch_input_str = (("Mary gasta $ 20 en pizza"), ("A ella le gusta comerlo"), ("La pizza estuvo genial"))
encoded = tokenizer.prepare_seq2seq_batch(batch_input_str)
translated = model.generate(**encoded)
tokenizer.batch_decode(translated, skip_special_tokens=True)

Output:

['Mary spends $20 on pizza', 'She likes to eat it.', 'The pizza was great.']

In case you are looking for a model that allows you to translate English to Spanish, you can use Helsinki-NLP/opus-mt-en-ROMANCE. The capital letters indicate that the model supports several languages. You can retrieve a list of the supported languages from the tokenizer:

tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-ROMANCE")
tokenizer.supported_language_codes

Output:

['>>fr<<',
 '>>es<<',
 '>>it<<',
 '>>pt<<',
 '>>pt_br<<',
 '>>ro<<',
 '>>ca<<',
 '>>gl<<',
 '>>pt_BR<<',
 '>>la<<',
 '>>wa<<',
 '>>fur<<',
 '>>oc<<',
 '>>fr_CA<<',
 '>>sc<<',
 '>>es_ES<<',
 '>>es_MX<<',
 '>>es_AR<<',
 '>>es_PR<<',
 '>>es_UY<<',
 '>>es_CL<<',
 '>>es_CO<<',
 '>>es_CR<<',
 '>>es_GT<<',
 '>>es_HN<<',
 '>>es_NI<<',
 '>>es_PA<<',
 '>>es_PE<<',
 '>>es_VE<<',
 '>>es_DO<<',
 '>>es_EC<<',
 '>>es_SV<<',
 '>>an<<',
 '>>pt_PT<<',
 '>>frp<<',
 '>>lad<<',
 '>>vec<<',
 '>>fr_FR<<',
 '>>co<<',
 '>>it_IT<<',
 '>>lld<<',
 '>>lij<<',
 '>>lmo<<',
 '>>nap<<',
 '>>rm<<',
 '>>scn<<',
 '>>mwl<<']

You can use these language codes to define the target and get the expected translation:

model = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-en-ROMANCE")
batch_input_str = (("Mary spends $20 on pizza"), ("She likes eating it"), ("The pizza was great"))
#we define Spanish as target language
batch_input_str = [ '>>es<< '+ x for x in batch_input_str]
encoded = tokenizer.prepare_seq2seq_batch(batch_input_str)
translated = model.generate(**encoded)
tokenizer.batch_decode(translated, skip_special_tokens=True)

Output:

['Mary gasta $20 en pizza', 'A ella le gusta comerlo.', 'La pizza fue genial.']

Upvotes: 3

Related Questions