How to do language model training on BERT

Question

I want to train BERT on a target corpus. I am looking at this HuggingFace implementation. They are using .raw files for the training data. If I have .txt files of my training data, how can I use their implementation?

Michael Jungo · Accepted Answer

The .raw only indicates that they use the raw version of the WikiText, they are regular text files containing the raw text:

We're using the raw WikiText-2 (no tokens were replaced before the tokenization).

The description of the data files options also says that they are text files. From run_language_modeling.py - L86-L88:

train_data_file: Optional[str] = field(
    default=None, metadata={"help": "The input training data file (a text file)."}
)

Therefore you can just specify your text files.

How to do language model training on BERT

Answers (1)

Related Questions