Reputation: 141
I want to get translations of one batch of sentences using pretrained model.
model = AutoModelWithLMHead.from_pretrained("Helsinki-NLP/opus-mt-es-en")
tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-es-en")
batch_input_str = (("Mary spends $20 on pizza"), ("She likes eating it"), ("The pizza was great"))
encoded = (tokenizer.batch_encode_plus(batch_input_str, pad_to_max_length=True))
The encoded
is like:
{'input_ids': [[4963, 10154, 5021, 9, 25, 1326, 2255, 35, 17462, 0], [552, 3996, 2274, 9, 129, 75, 2223, 25, 1370, 0], [42, 17462, 12378, 9, 25, 5807, 1949, 0, 65000, 65000]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 0, 0]]}
Then, should I just pass the encoded
to
output = model.generate(a)
And then use
res = tokenizer.decode(output)
?
Thanks!
Upvotes: 2
Views: 1417
Reputation: 19520
The model Helsinki-NLP/opus-mt-es-en translates from Spanish to English. Please have a look at the examples below:
# use AutoModelForSeq2SeqLM because AutoModelWithLMHead is deprecated
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-es-en")
tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-es-en")
batch_input_str = (("Mary gasta $ 20 en pizza"), ("A ella le gusta comerlo"), ("La pizza estuvo genial"))
encoded = tokenizer.prepare_seq2seq_batch(batch_input_str)
translated = model.generate(**encoded)
tokenizer.batch_decode(translated, skip_special_tokens=True)
Output:
['Mary spends $20 on pizza', 'She likes to eat it.', 'The pizza was great.']
In case you are looking for a model that allows you to translate English to Spanish, you can use Helsinki-NLP/opus-mt-en-ROMANCE. The capital letters indicate that the model supports several languages. You can retrieve a list of the supported languages from the tokenizer:
tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-ROMANCE")
tokenizer.supported_language_codes
Output:
['>>fr<<',
'>>es<<',
'>>it<<',
'>>pt<<',
'>>pt_br<<',
'>>ro<<',
'>>ca<<',
'>>gl<<',
'>>pt_BR<<',
'>>la<<',
'>>wa<<',
'>>fur<<',
'>>oc<<',
'>>fr_CA<<',
'>>sc<<',
'>>es_ES<<',
'>>es_MX<<',
'>>es_AR<<',
'>>es_PR<<',
'>>es_UY<<',
'>>es_CL<<',
'>>es_CO<<',
'>>es_CR<<',
'>>es_GT<<',
'>>es_HN<<',
'>>es_NI<<',
'>>es_PA<<',
'>>es_PE<<',
'>>es_VE<<',
'>>es_DO<<',
'>>es_EC<<',
'>>es_SV<<',
'>>an<<',
'>>pt_PT<<',
'>>frp<<',
'>>lad<<',
'>>vec<<',
'>>fr_FR<<',
'>>co<<',
'>>it_IT<<',
'>>lld<<',
'>>lij<<',
'>>lmo<<',
'>>nap<<',
'>>rm<<',
'>>scn<<',
'>>mwl<<']
You can use these language codes to define the target and get the expected translation:
model = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-en-ROMANCE")
batch_input_str = (("Mary spends $20 on pizza"), ("She likes eating it"), ("The pizza was great"))
#we define Spanish as target language
batch_input_str = [ '>>es<< '+ x for x in batch_input_str]
encoded = tokenizer.prepare_seq2seq_batch(batch_input_str)
translated = model.generate(**encoded)
tokenizer.batch_decode(translated, skip_special_tokens=True)
Output:
['Mary gasta $20 en pizza', 'A ella le gusta comerlo.', 'La pizza fue genial.']
Upvotes: 3