Darshan
Darshan

Reputation: 2333

What is the difference in RobertaTokenizer() and from_pretrained() way of initialising RobertaTokenizer?

I am a newbie to huggingface transformers and facing the below issue in training a RobertaForMaskedLM LM from scratch:

First, I have trained and saved a ByteLevelBPETokenizer as follows:

tokenizer = ByteLevelBPETokenizer()
print('Saving tokenizer at:', training_file)
tokenizer.train(files=training_file, vocab_size=VOCAB_SIZE, min_frequency=2, 
special_tokens=["<s>","<pad>","</s>","<unk>","<mask>"])
tokenizer.save_model(tokenizer_mdl_dir)

Then, trained RobertaForMaskedLM using this tokenizer by creating a RobertaTokenizer as follows:

roberta_tokenizer = RobertaTokenizer(tokenizer_mdl + "/vocab.json", tokenizer_mdl + "/merges.txt")

But now, when I try to test the trained LM using a fill-mask pipeline,

fill_mask_pipeline = pipeline("fill-mask", model=roberta_model, tokenizer=roberta_tokenizer)

I got the below error:

PipelineException: No mask_token () found on the input

So, I realized, the tokenizer that I have loaded, is tokenizing the <mask> token as well. But I couldn't understand why it is doing so. Please help me understand this.

After trying several things, I loaded the tokenizer differently,

roberta_tokenizer = RobertaTokenizer.from_pretrained(tokenizer_mdl)

And, now the fill_mask_pipeline runs without errors. So, what is the difference between loading a tokenizer using RobertaTokenizer() and using the .from_pretrained() method?

Upvotes: 1

Views: 1721

Answers (1)

cronoik
cronoik

Reputation: 19450

When you compare the property unique_no_split_tokens, you will see that this is initialized for the from_pretrained tokenizer but not for the other.

#from_pretrained
t1.unique_no_split_tokens
['</s>', '<mask>', '<pad>', '<s>', '<unk>']

#__init__
t2.unique_no_split_tokens
[]

This property is filled by _add_tokens() that is called by from_pretrained but not by __init__. I'm actually not sure if this is a bug or a feature. from_pretrained is the recommended method to initialize a tokenizer from a pretrained tokenizer and should therefore be used.

Upvotes: 1

Related Questions