Reputation: 111
I am using Huggingface BERT for an NLP task. My texts contain names of companies which are split up into subwords.
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
tokenizer.encode_plus("Somespecialcompany")
output: {'input_ids': [101, 2070, 13102, 8586, 4818, 9006, 9739, 2100, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}
Now, I would like to add those names to the tokenizer IDs so they are not split up.
tokenizer.add_tokens("Somespecialcompany")
output: 1
This extends the length of the tokenizer from 30522 to 30523.
The desired output would therefore be the new ID:
tokenizer.encode_plus("Somespecialcompany")
output: 30522
But the output is the same as before:
output: {'input_ids': [101, 2070, 13102, 8586, 4818, 9006, 9739, 2100, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}
So my question is; what is the right way of adding new tokens to the tokenizer so I can use them with tokenizer.encode_plus()
and tokenizer.batch_encode_plus()
?
Upvotes: 6
Views: 9943
Reputation: 122280
Source: https://www.depends-on-the-definition.com/how-to-add-new-tokens-to-huggingface-transformers/
from transformers import AutoTokenizer, AutoModel
# pick the model type
model_type = "roberta-base"
tokenizer = AutoTokenizer.from_pretrained(model_type)
model = AutoModel.from_pretrained(model_type)
# new tokens
new_tokens = ["new_token"]
# check if the tokens are already in the vocabulary
new_tokens = set(new_tokens) - set(tokenizer.vocab.keys())
# add the tokens to the tokenizer vocabulary
tokenizer.add_tokens(list(new_tokens))
# add new, random embeddings for the new tokens
model.resize_token_embeddings(len(tokenizer))
Upvotes: 1
Reputation: 1
I'm not sure you want to be adding it as a special token; special tokens have other behavior that would not be desirable here (e.g. decode with skip_special_tokens=True). Try using the AddedToken class with single_word=True instead:
tokenizer.add_tokens(tokenizers.AddedToken("somecompanyname", single_word=True))
see here: https://huggingface.co/docs/tokenizers/v0.13.3/en/api/added-tokens
Upvotes: 0