Huggingface BERT Tokenizer add new token

I am using Huggingface BERT for an NLP task. My texts contain names of companies which are split up into subwords.

tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
tokenizer.encode_plus("Somespecialcompany")
output: {'input_ids': [101, 2070, 13102, 8586, 4818, 9006, 9739, 2100, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

Now, I would like to add those names to the tokenizer IDs so they are not split up.

tokenizer.add_tokens("Somespecialcompany")
output: 1

This extends the length of the tokenizer from 30522 to 30523.

The desired output would therefore be the new ID:

tokenizer.encode_plus("Somespecialcompany")
output: 30522

But the output is the same as before:

output: {'input_ids': [101, 2070, 13102, 8586, 4818, 9006, 9739, 2100, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

So my question is; what is the right way of adding new tokens to the tokenizer so I can use them with tokenizer.encode_plus() and tokenizer.batch_encode_plus()?

Upvotes: 6

Answers (3)

alvas

Reputation: 122280

Source: https://www.depends-on-the-definition.com/how-to-add-new-tokens-to-huggingface-transformers/

from transformers import AutoTokenizer, AutoModel

# pick the model type
model_type = "roberta-base"
tokenizer = AutoTokenizer.from_pretrained(model_type)
model = AutoModel.from_pretrained(model_type)

# new tokens
new_tokens = ["new_token"]

# check if the tokens are already in the vocabulary
new_tokens = set(new_tokens) - set(tokenizer.vocab.keys())

# add the tokens to the tokenizer vocabulary
tokenizer.add_tokens(list(new_tokens))

# add new, random embeddings for the new tokens
model.resize_token_embeddings(len(tokenizer))

Upvotes: 1

Dan Hilgart

Reputation: 1

I'm not sure you want to be adding it as a special token; special tokens have other behavior that would not be desirable here (e.g. decode with skip_special_tokens=True). Try using the AddedToken class with single_word=True instead:

tokenizer.add_tokens(tokenizers.AddedToken("somecompanyname", single_word=True))

see here: https://huggingface.co/docs/tokenizers/v0.13.3/en/api/added-tokens

Upvotes: 0

Nui

Reputation: 111

I opened a bug report on github. And apparently I just have to set the special_tokens argument to True:

tokenizer.add_tokens(["somecompanyname"], special_tokens=True)
output: 30522

Upvotes: 5

Huggingface BERT Tokenizer add new token

Answers (3)

Related Questions