Mr. NLP
Mr. NLP

Reputation: 1001

Confusion in Pre-processing text for Roberta Model

I want to apply Roberta model for text similarity. Given a pair of sentences,the input should be in the format <s> A </s></s> B </s>. I figure out two possible ways to generate the input ids namely

a)

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained('roberta-base')

list1 = tokenizer.encode('Very severe pain in hands')

list2 = tokenizer.encode('Numbness of upper limb')

sequence = list1+[2]+list2[1:]

In this case, sequence is [0, 12178, 3814, 2400, 11, 1420, 2, 2, 234, 4179, 1825, 9, 2853, 29654, 2]

b)

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained('roberta-base')

list1 = tokenizer.encode('Very severe pain in hands', add_special_tokens=False)

list2 = tokenizer.encode('Numbness of upper limb', add_special_tokens=False)

sequence = [0]+list1+[2,2]+list2+[2]

In this case, sequence is [0, 25101, 3814, 2400, 11, 1420, 2, 2, 487, 4179, 1825, 9, 2853, 29654, 2]

Here 0 represents <s> token and 2 represents </s> token. I'm not sure which is the correct way to encode the given two sentences for calculating sentence similarity using Roberta model.

Upvotes: 1

Views: 1824

Answers (1)

dennlinger
dennlinger

Reputation: 11420

The easiest way is probably to directly use the provided function by HuggingFace's Tokenizers themselves, namely the text_pair argument in the encode function, see here. This allows you to directly feed in two sentences, which will be giving you the desired output:

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained('roberta-base')
sequence = tokenizer.encode(text='Very severe pain in hands',
                            text_pair='Numbness of upper limb',
                            add_special_tokens=True)

This is especially convenient if you are dealing with very long sequences, as the encode function automatically reduces your lengths according to the truncaction_strategy argument. You obviously don't have to worry about this, if it is only short sequences.

Alternatively, you can also make use of the more explicit build_inputs_with_special_tokens() function of the RobertaTokenizer, specifically, which could be added to your example like so:

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained('roberta-base')

list1 = tokenizer.encode('Very severe pain in hands', add_special_tokens=False)
list2 = tokenizer.encode('Numbness of upper limb', add_special_tokens=False)

sequence = tokenizer.build_inputs_with_special_tokens(list1, list2)

Note that in that case, you have to generate the sequences list1 and list2 still without any special tokens, as you have already done correctly.

Upvotes: 5

Related Questions