Takumi
Takumi

Reputation: 43

About get_special_tokens_mask in huggingface-transformers

I use transformers tokenizer, and created mask using API: get_special_tokens_mask.
My Code

In RoBERTa Doc, returns of this API is "A list of integers in the range [0, 1]: 0 for a special token, 1 for a sequence token". But I seem that this API returns "0 for a sequence token, 1 for a special token".
Is it all right?

Upvotes: 4

Views: 2691

Answers (1)

dennlinger
dennlinger

Reputation: 11488

You are indeed correct. I tested this for both transformers 2.7 and the (at the time of writing) current release of 2.9, and in both cases I do get the inverted results (0 for regular characters, and 1 for the special characters.

For reference, this is how I tested it:

import transformers

tokenizer = transformers.AutoTokenizer.from_pretrained("roberta-base")
sentence = "This is a special sentence."

encoded_sentence = tokenizer.encode(sentence)
# [0, 152, 16, 10, 780, 3645, 4, 2]
special_masks = tokenizer.get_special_tokens_mask(encoded_sentence)
# [1, 0, 0, 0, 0, 0, 0, 0, 0, 1]

I would suggest you report this issue in their repository, or ideally provide a pull request yourself to fix the issue ;-)

Upvotes: 2

Related Questions