Any reason to save a pretrained BERT tokenizer?

Question

Say I am using tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True), and all I am doing with that tokenizer during fine-tuning of a new model is the standard tokenizer.encode()

I have seen in most places that people save that tokenizer at the same time that they save their model, but I am unclear on why it's necessary to save since it seems like an out-of-the-box tokenizer that does not get modified in any way during training.

Ashwin Geet D&#39;Sa · Accepted Answer

In your case, if you are using tokenizer only to tokenize the text (encode()), then you need not have to save the tokenizer. You can always load the tokenizer of the pretrained model.

However, sometimes you may want to use the tokenizer of the pretrained model, then add new tokens to it's vocabulary, or redefine the special symbols such as '[CLS]', '[MASK]', '[SEP]', '[PAD]' or any such special tokens. In this case, since you have made the changes to the tokenizer, it will be useful to save the tokenizer for the future use.

Any reason to save a pretrained BERT tokenizer?

Answers (2)

Related Questions