Reputation: 135
tokenizer add_tokens is not adding new tokens. Here is my code:
from transformers import BertTokenizer
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
new_tokens = []
text = open("parsed_data.txt", "r")
for line in text:
for word in line.split():
new_tokens.append(word)
print(len(new_tokens)) # 53966
print(len(bert_tokenizer)) # 36369
bert_tokenizer.add_tokens(new_tokens)
print(len(bert_tokenizer)) # 36369
Upvotes: 0
Views: 526
Reputation: 2348
Yes, if a token already exists, it is skipped. By the way, after changing the tokenizer you have to also update your model. See the last line below.
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
tokenizer.add_tokens(my_new_tokens)
model.resize_token_embeddings(len(bert_tokenizer))
Upvotes: 2