ShaoMin Liu
ShaoMin Liu

Reputation: 133

what is so special about special tokens?

what exactly is the difference between "token" and a "special token"?

I understand the following:

What I don't understand is, under what kind of capacity will you want to create a new special token, any examples what we need it for and when we want to create a special token other than those default special tokens? If an example uses a special token, why can't a normal token achieve the same objective?

tokenizer.add_tokens(['[EOT]'], special_tokens=True)

And I also dont quite understand the following description in the source documentation. what difference does it do to our model if we set add_special_tokens to False?

add_special_tokens (bool, optional, defaults to True) — Whether or not to encode the sequences with the special tokens relative to their model.

Upvotes: 12

Views: 8090

Answers (1)

cronoik
cronoik

Reputation: 19310

Special tokens are called special because they are not derived from your input. They are added for a certain purpose and are independent of the specific input.

What I don't understand is, under what kind of capacity will you want to create a new special token, any examples what we need it for and when we want to create a special token other than those default special tokens?

Just an example, in extractive conversational question-answering it is not unusual to add the question and answer of the previous dialog-turn to your input to provide some context for your model. Those previous dialog turns are separated with special tokens from the current question. Sometimes people use the separator token of the model or introduce new special tokens. The following is an example with a new special token [Q]

#first dialog turn - no conversation history
[CLS] current question [SEP] text [EOS]
#second dialog turn - with previous question to have some context
[CLS] previous question [Q] current question [SEP] text [EOS]

And I also dont quite understand the following description in the source documentation. what difference does it do to our model if we set add_special_tokens to False?

from transformers import RobertaTokenizer
t = RobertaTokenizer.from_pretrained("roberta-base")

t("this is an example")
#{'input_ids': [0, 9226, 16, 41, 1246, 2], 'attention_mask': [1, 1, 1, 1, 1, 1]}

t("this is an example", add_special_tokens=False)
#{'input_ids': [9226, 16, 41, 1246], 'attention_mask': [1, 1, 1, 1]}

As you can see here, the input misses two tokens (the special tokens). Those special tokens have a meaning for your model since it was trained with it. The last_hidden_state will be different due to the lack of those two tokens and will therefore lead to a different result for your downstream task.

Some tasks, like sequence classification, often use the [CLS] token to make their predictions. When you remove them, a model that was pre-trained with a [CLS] token will struggle.

Upvotes: 12

Related Questions