Reputation: 151
i use tokenizers to train a Tokenizer and save the model like this
tokenizer = Tokenizer(BPE())
tokenizer.pre_tokenizer = Whitespace()
tokenizer.decoder = ByteLevelDecoder()
trainer = BpeTrainer(vocab_size=25000, show_progress=True, initial_alphabet=ByteLevel.alphabet())
tokenizer.train(files=["/content/drive/MyDrive/Work/NLP/bert_practice/data/doc.txt"], trainer=trainer)
tokenizer.model.save('/content/drive/MyDrive/Work/NLP/bert_practice/data/tokenizer')
['/content/drive/MyDrive/Work/NLP/bert_practice/data/tokenizer/vocab.json',
'/content/drive/MyDrive/Work/NLP/bert_practice/data/tokenizer/merges.txt']
and it works well:
tokenizer.encode("东风日产2021款劲客正式上市").tokens
['东风日产', '2021款', '劲客', '正式上市']
but when i load the model by transformers's BertTokenizer like this:
from transformers import BertTokenizer
tokenizer = BertTokenizer(
vocab_file="/content/drive/MyDrive/Work/NLP/bert_practice/data/tokenizer/vocab.json",
#merges_file="/content/drive/MyDrive/Work/NLP/bert_practice/data/tokenizer/merges.txt",
)
it always predict '[UNK]' as follows:
tokenizer.tokenize("奥迪A5有着年轻时尚的外观,动力强、操控也很棒")
['[UNK]',
'[UNK]',
'[UNK]',
'[UNK]',
'[UNK]',
'[UNK]',
'[UNK]',
'[UNK]',
'[UNK]',
'[UNK]',
'[UNK]',
'[UNK]',
'[UNK]',
'[UNK]',
'[UNK]',
'[UNK]',
'[UNK]',
'[UNK]',
'[UNK]',
'[UNK]',
'[UNK]',
'[UNK]']
Could anyone figure out the problem?Any suggestion to solve this will be very helpful.
Upvotes: 1
Views: 873
Reputation: 2338
You are trying to read a BPE based tokenizer into BERTTokenizer but BERTTokenizer does not use BPE. It uses WordPiece Tokenizers. So, there is an incompatibility. See this link here from the HuggingFace library. https://huggingface.co/transformers/tokenizer_summary.html#wordpiece
Upvotes: 1