How to fine-tune gpt2 with a custom set of unlabelled document

Question

I'm newbie to GPT2 fine-tuning. My goal is to fine-tune GPT-2 (or BERT) on a my own set of document, in order to be able to query the bot on a topic contained in these documents, and receive an answer. I have some doubts on how to develop this, because I saw that fine tuning a Question and Answer chatbot requires a labelled dataset, containing questions relatet to a answer.

Is it possible to fine tune a language model on an unlabelled dataset? After I train the model on my data, can I already query it or anyway is there a need to fine-tune on a specific task using an annotated dataset? Is there a minumum number of documents on order to achieve good results? Is it possible to do on a non-english language? Thank you.

How to fine-tune gpt2 with a custom set of unlabelled document

Answers (1)

Related Questions