Reputation: 1001
I want to apply Roberta model for text similarity. Given a pair of sentences,the input should be in the format <s> A </s></s> B </s>
. I figure out two possible ways to generate the input ids namely
a)
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('roberta-base')
list1 = tokenizer.encode('Very severe pain in hands')
list2 = tokenizer.encode('Numbness of upper limb')
sequence = list1+[2]+list2[1:]
In this case, sequence is [0, 12178, 3814, 2400, 11, 1420, 2, 2, 234, 4179, 1825, 9, 2853, 29654, 2]
b)
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('roberta-base')
list1 = tokenizer.encode('Very severe pain in hands', add_special_tokens=False)
list2 = tokenizer.encode('Numbness of upper limb', add_special_tokens=False)
sequence = [0]+list1+[2,2]+list2+[2]
In this case, sequence is [0, 25101, 3814, 2400, 11, 1420, 2, 2, 487, 4179, 1825, 9, 2853, 29654, 2]
Here 0
represents <s>
token and 2 represents </s>
token. I'm not sure which is the correct way to encode the given two sentences for calculating sentence similarity using Roberta model.
Upvotes: 1
Views: 1824
Reputation: 11420
The easiest way is probably to directly use the provided function by HuggingFace's Tokenizers themselves, namely the text_pair
argument in the encode
function, see here. This allows you to directly feed in two sentences, which will be giving you the desired output:
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('roberta-base')
sequence = tokenizer.encode(text='Very severe pain in hands',
text_pair='Numbness of upper limb',
add_special_tokens=True)
This is especially convenient if you are dealing with very long sequences, as the encode
function automatically reduces your lengths according to the truncaction_strategy
argument. You obviously don't have to worry about this, if it is only short sequences.
Alternatively, you can also make use of the more explicit build_inputs_with_special_tokens()
function of the RobertaTokenizer
, specifically, which could be added to your example like so:
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('roberta-base')
list1 = tokenizer.encode('Very severe pain in hands', add_special_tokens=False)
list2 = tokenizer.encode('Numbness of upper limb', add_special_tokens=False)
sequence = tokenizer.build_inputs_with_special_tokens(list1, list2)
Note that in that case, you have to generate the sequences list1
and list2
still without any special tokens, as you have already done correctly.
Upvotes: 5