Rish
Rish

Reputation: 569

How to fine tune BERT on unlabeled data?

I want to fine tune BERT on a specific domain. I have texts of that domain in text files. How can I use these to fine tune BERT? I am looking here currently.

My main objective is to get sentence embeddings using BERT.

Upvotes: 14

Views: 9627

Answers (4)

dennlinger
dennlinger

Reputation: 11488

The important distinction to make here is whether you want to fine-tune your model, or whether you want to expose it to additional pretraining.

The former is simply a way to train BERT to adapt to a specific supervised task, for which you generally need in the order of 1000 or more samples including labels.

Pretraining, on the other hand, is basically trying to help BERT better "understand" data from a certain domain, by basically continuing its unsupervised training objective ([MASK]ing specific words and trying to predict what word should be there), for which you do not need labeled data.

If your ultimate objective is sentence embeddings, however, I would strongly suggest you to have a look at Sentence Transformers, which is based on a slightly outdated version of Huggingface's transformers library, but primarily tries to generate high-quality embeddings. Note that there are ways to train with surrogate losses, where you try to emulate some form ofloss that is relevant for embeddings.

Edit: The author of Sentence-Transformers recently joined Huggingface, so I expect support to greatly improve over the upcoming months!

Upvotes: 14

pashok3ddd
pashok3ddd

Reputation: 335

@dennlinger gave an exhaustive answer. Additional pretraining is also referred as "post-training", "domain adaptation" and "language modeling fine-tuning". here you will find an example how to do it.

But, since you want to have good sentence embeddings, you better use Sentence Transformers. Moreover, they provide fine-tuned models, which already capable of understanding semantic similarity between sentences. "Continue Training on Other Data" section is what you want to further fine-tune the model on your domain. You do have to prepare training dataset, according to one of available loss functions. E.g. ContrastLoss requires a pair of texts and a label, whether this pair is similar.

Upvotes: 2

Plabon Dutta
Plabon Dutta

Reputation: 7289

You are talking about pre-training. Fine-tuning on unlabeled data is called pre-training and for getting started, you can take a look over here.

Upvotes: 0

KaiJun
KaiJun

Reputation: 127

I believe transfer learning is useful to train the model on a specific domain. First you load the pretrained base model and freeze its weights, then you add another layer on top of the base model and train that layer based on your own training data. However, the data would need to be labelled.

Tensorflow has some useful guide on transfer learning.

Upvotes: 0

Related Questions