What is the difference in RobertaTokenizer() and from_pretrained() way of initialising RobertaTokenizer?

Question

I am a newbie to huggingface transformers and facing the below issue in training a RobertaForMaskedLM LM from scratch:

First, I have trained and saved a ByteLevelBPETokenizer as follows:

tokenizer = ByteLevelBPETokenizer()
print('Saving tokenizer at:', training_file)
tokenizer.train(files=training_file, vocab_size=VOCAB_SIZE, min_frequency=2, 
special_tokens=["","","","",""])
tokenizer.save_model(tokenizer_mdl_dir)

Then, trained RobertaForMaskedLM using this tokenizer by creating a RobertaTokenizer as follows:

roberta_tokenizer = RobertaTokenizer(tokenizer_mdl + "/vocab.json", tokenizer_mdl + "/merges.txt")

But now, when I try to test the trained LM using a fill-mask pipeline,

fill_mask_pipeline = pipeline("fill-mask", model=roberta_model, tokenizer=roberta_tokenizer)

I got the below error:

PipelineException: No mask_token () found on the input

So, I realized, the tokenizer that I have loaded, is tokenizing the token as well. But I couldn't understand why it is doing so. Please help me understand this.

After trying several things, I loaded the tokenizer differently,

roberta_tokenizer = RobertaTokenizer.from_pretrained(tokenizer_mdl)

And, now the fill_mask_pipeline runs without errors. So, what is the difference between loading a tokenizer using RobertaTokenizer() and using the .from_pretrained() method?

cronoik · Accepted Answer

When you compare the property unique_no_split_tokens, you will see that this is initialized for the from_pretrained tokenizer but not for the other.

#from_pretrained
t1.unique_no_split_tokens
['', '', '', '', '']

#__init__
t2.unique_no_split_tokens
[]

This property is filled by _add_tokens() that is called by from_pretrained but not by __init__. I'm actually not sure if this is a bug or a feature. from_pretrained is the recommended method to initialize a tokenizer from a pretrained tokenizer and should therefore be used.

What is the difference in RobertaTokenizer() and from_pretrained() way of initialising RobertaTokenizer?

Answers (1)

Related Questions