Reputation: 43
I want to make BERT model by training with more data (not a fine-tuning, the base model which will be trained is 'bert-base-uncased'). However, do i always need to create own tokenizer for one model? when i use 'bert-base-uncased' tokenizer to train model, it give me some error.
Traceback (most recent call last):
File "log.py", line 10, in <module>
print(model(**input_idx))
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/transformers/models/bert/modeling_bert.py", line 989, in forward
embedding_output = self.embeddings(
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/transformers/models/bert/modeling_bert.py", line 214, in forward
inputs_embeds = self.word_embeddings(input_ids)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/sparse.py", line 158, in forward
return F.embedding(
File "/usr/local/lib/python3.8/dist-packages/torch/nn/functional.py", line 2044, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self
so does the model should have it's own tokenizer which trained with same data?
Upvotes: 1
Views: 1236
Reputation: 647
I recommend you to actually resize the embeedding matrix to match the size of the tokenizer you want to use:
model.resize_token_embeddings(len(tokenizer))
Huggingface docs: https://huggingface.co/docs/transformers/master/en/main_classes/model#transformers.PreTrainedModel.resize_token_embeddings
Upvotes: 1