ginobimura
ginobimura

Reputation: 115

Any reason to save a pretrained BERT tokenizer?

Say I am using tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True), and all I am doing with that tokenizer during fine-tuning of a new model is the standard tokenizer.encode()

I have seen in most places that people save that tokenizer at the same time that they save their model, but I am unclear on why it's necessary to save since it seems like an out-of-the-box tokenizer that does not get modified in any way during training.

Upvotes: 2

Views: 2478

Answers (2)

Ashwin Geet D'Sa
Ashwin Geet D'Sa

Reputation: 7379

In your case, if you are using tokenizer only to tokenize the text (encode()), then you need not have to save the tokenizer. You can always load the tokenizer of the pretrained model.

However, sometimes you may want to use the tokenizer of the pretrained model, then add new tokens to it's vocabulary, or redefine the special symbols such as '[CLS]', '[MASK]', '[SEP]', '[PAD]' or any such special tokens. In this case, since you have made the changes to the tokenizer, it will be useful to save the tokenizer for the future use.

Upvotes: 3

prosti
prosti

Reputation: 46449

You can always wake up the tokenizer with:

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

This may be just part of the routine, that is not so needed.

Upvotes: 0

Related Questions