Does BERT model and tokenizer should be trained with same data?

Question

I want to make BERT model by training with more data (not a fine-tuning, the base model which will be trained is 'bert-base-uncased'). However, do i always need to create own tokenizer for one model? when i use 'bert-base-uncased' tokenizer to train model, it give me some error.

Traceback (most recent call last):
  File "log.py", line 10, in 
    print(model(**input_idx))
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/bert/modeling_bert.py", line 989, in forward
    embedding_output = self.embeddings(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/bert/modeling_bert.py", line 214, in forward
    inputs_embeds = self.word_embeddings(input_ids)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/sparse.py", line 158, in forward
    return F.embedding(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/functional.py", line 2044, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self

so does the model should have it's own tokenizer which trained with same data?

Does BERT model and tokenizer should be trained with same data?

Answers (1)

Related Questions