bert_vocab.bert_vocab_from_dataset taking too long

Question

I'm following this tutorial (https://colab.research.google.com/github/tensorflow/text/blob/master/docs/guide/subwords_tokenizer.ipynb#scrollTo=kh98DvoDz7Jn) to generate a vocabulary from a custom dataset. In the tutorial, it takes around 2 minutes for this code to complete:

bert_vocab_args = dict(
    # The target vocabulary size
    vocab_size = 8000,
    # Reserved tokens that must be included in the vocabulary
    reserved_tokens=reserved_tokens,
    # Arguments for `text.BertTokenizer`
    bert_tokenizer_params=bert_tokenizer_params,
    # Arguments for `wordpiece_vocab.wordpiece_tokenizer_learner_lib.learn`
    learn_params={},
)

pt_vocab = bert_vocab.bert_vocab_from_dataset(
    train_pt.batch(1000).prefetch(2),
    **bert_vocab_args
)

On my dataset it takes a lot longer... I tried increasing the batch number as well as decreasing the size of the vocabulary, all to no avail. Is there any way to make this go faster?

bert_vocab.bert_vocab_from_dataset taking too long

Answers (1)

Related Questions