Reputation: 33
I tried to add new words to the Bert tokenizer vocab
. I see that the length of the vocab is increasing, however I can't find the newly added word in the vocab.
tokenizer.add_tokens(['covid', 'wuhan'])
v = tokenizer.get_vocab()
print(len(v))
'covid' in tokenizer.vocab
Output:
30524
False
Upvotes: 3
Views: 3317
Reputation: 19385
You are calling two different things with tokenizer.vocab
and tokenizer.get_vocab()
. The first one contains the base vocabulary without the added tokens, while the other one contains the base vocabulary with the added tokens.
from transformers import BertTokenizer
t = BertTokenizer.from_pretrained('bert-base-uncased')
print(len(t.vocab))
print(len(t.get_vocab()))
print(t.get_added_vocab())
t.add_tokens(['covid'])
print(len(t.vocab))
print(len(t.get_vocab()))
print(t.get_added_vocab())
Output:
30522
30522
{}
30522
30523
{'covid': 30522}
Upvotes: 3