Reputation: 31
I have my own corpus of plain text. I want to train a Bert model in TensorFlow, similar to gensim's word2vec to get the embedding vectors for each word.
What I have found is that all the examples are related to any downstream NLP tasks like classification. But, I want to train a Bert model with my custom corpus after which I can get the embedding vectors for a given word.
Any lead will be helpful.
Upvotes: 2
Views: 1398
Reputation: 538
If you have access to the required hardware, you can dig into NVIDIA's training scripts for BERT using TensorFlow. The repo is here. From the medium article:
BERT-large can be pre-trained in 3.3 days on four DGX-2H nodes (a total of 64 Volta GPUs).
If you don't have an enormous corpus, you will probably have better results fine-tuning an available model. If you would like to do so, you can look into huggingface's transformers.
Upvotes: 1