Jack.Sparrow
Jack.Sparrow

Reputation: 151

how to use BertTokenizer to load Tokenizer model?

i use tokenizers to train a Tokenizer and save the model like this

tokenizer = Tokenizer(BPE())
tokenizer.pre_tokenizer = Whitespace()
tokenizer.decoder = ByteLevelDecoder()
trainer = BpeTrainer(vocab_size=25000, show_progress=True, initial_alphabet=ByteLevel.alphabet())

tokenizer.train(files=["/content/drive/MyDrive/Work/NLP/bert_practice/data/doc.txt"], trainer=trainer)

tokenizer.model.save('/content/drive/MyDrive/Work/NLP/bert_practice/data/tokenizer')

['/content/drive/MyDrive/Work/NLP/bert_practice/data/tokenizer/vocab.json',
 '/content/drive/MyDrive/Work/NLP/bert_practice/data/tokenizer/merges.txt']

and it works well:

tokenizer.encode("东风日产2021款劲客正式上市").tokens
['东风日产', '2021款', '劲客', '正式上市']

but when i load the model by transformers's BertTokenizer like this:

from transformers import BertTokenizer

tokenizer = BertTokenizer(
    vocab_file="/content/drive/MyDrive/Work/NLP/bert_practice/data/tokenizer/vocab.json",
    #merges_file="/content/drive/MyDrive/Work/NLP/bert_practice/data/tokenizer/merges.txt",
)

it always predict '[UNK]' as follows:

tokenizer.tokenize("奥迪A5有着年轻时尚的外观,动力强、操控也很棒")
['[UNK]',
 '[UNK]',
 '[UNK]',
 '[UNK]',
 '[UNK]',
 '[UNK]',
 '[UNK]',
 '[UNK]',
 '[UNK]',
 '[UNK]',
 '[UNK]',
 '[UNK]',
 '[UNK]',
 '[UNK]',
 '[UNK]',
 '[UNK]',
 '[UNK]',
 '[UNK]',
 '[UNK]',
 '[UNK]',
 '[UNK]',
 '[UNK]']

Could anyone figure out the problem?Any suggestion to solve this will be very helpful.

Upvotes: 1

Views: 873

Answers (1)

Berkay Berabi
Berkay Berabi

Reputation: 2338

You are trying to read a BPE based tokenizer into BERTTokenizer but BERTTokenizer does not use BPE. It uses WordPiece Tokenizers. So, there is an incompatibility. See this link here from the HuggingFace library. https://huggingface.co/transformers/tokenizer_summary.html#wordpiece

Upvotes: 1

Related Questions