Reputation: 483
I'm trying to add a few new words to the vocabulary of a pretrained HuggingFace Transformers model. I did the following to change the vocabulary of the tokenizer and also increase the embedding size of the model:
tokenizer.add_tokens(['word1', 'word2', 'word3', 'word4'])
model.resize_token_embeddings(len(tokenizer))
print(len(tokenizer)) # outputs len_vocabulary + 4
But after training the model on my corpus and saving it, I found out that the saved tokenizer vocabulary size hasn't changed. After checking again I found out that the abovementioned code does not change the vocabulary size (tokenizer.vocab_size is still the same) and only the len(tokenizer) has changed.
So now my question is; what is the difference between tokenizer.vocab_size and len(tokenizer)?
Upvotes: 11
Views: 11691
Reputation: 1185
From the HuggingFace docs, if you search for the method vocab_size
you can see in the docstring that it returns the size excluding the added tokens:
Size of the base vocabulary (without the added tokens).
And then by also calling the len()
method on the tokenizer object, which itself calls the __len__
method:
def __len__(self):
"""
Size of the full vocabulary with the added tokens.
"""
return self.vocab_size + len(self.added_tokens_encoder)
So you can clearly see that the former returns the size excluding the added tokens, and the later includes the added tokens as it is essentially the former (vocab_size
) plus the len(added_tokens_encoder)
.
Upvotes: 16