Reputation: 2333
I am a newbie to huggingface transformers and facing the below issue in training a RobertaForMaskedLM
LM from scratch:
First, I have trained and saved a ByteLevelBPETokenizer
as follows:
tokenizer = ByteLevelBPETokenizer()
print('Saving tokenizer at:', training_file)
tokenizer.train(files=training_file, vocab_size=VOCAB_SIZE, min_frequency=2,
special_tokens=["<s>","<pad>","</s>","<unk>","<mask>"])
tokenizer.save_model(tokenizer_mdl_dir)
Then, trained RobertaForMaskedLM
using this tokenizer by creating a RobertaTokenizer
as follows:
roberta_tokenizer = RobertaTokenizer(tokenizer_mdl + "/vocab.json", tokenizer_mdl + "/merges.txt")
But now, when I try to test the trained LM using a fill-mask pipeline,
fill_mask_pipeline = pipeline("fill-mask", model=roberta_model, tokenizer=roberta_tokenizer)
I got the below error:
PipelineException: No mask_token () found on the input
So, I realized, the tokenizer that I have loaded, is tokenizing the <mask>
token as well. But I couldn't understand why it is doing so. Please help me understand this.
After trying several things, I loaded the tokenizer differently,
roberta_tokenizer = RobertaTokenizer.from_pretrained(tokenizer_mdl)
And, now the fill_mask_pipeline
runs without errors. So, what is the difference between loading a tokenizer using RobertaTokenizer()
and using the .from_pretrained()
method?
Upvotes: 1
Views: 1721
Reputation: 19450
When you compare the property unique_no_split_tokens
, you will see that this is initialized for the from_pretrained
tokenizer but not for the other.
#from_pretrained
t1.unique_no_split_tokens
['</s>', '<mask>', '<pad>', '<s>', '<unk>']
#__init__
t2.unique_no_split_tokens
[]
This property is filled by _add_tokens() that is called by from_pretrained
but not by __init__
. I'm actually not sure if this is a bug or a feature. from_pretrained
is the recommended method to initialize a tokenizer from a pretrained tokenizer and should therefore be used.
Upvotes: 1