Rish
Rish

Reputation: 569

How to do language model training on BERT

I want to train BERT on a target corpus. I am looking at this HuggingFace implementation. They are using .raw files for the training data. If I have .txt files of my training data, how can I use their implementation?

Upvotes: 0

Views: 433

Answers (1)

Michael Jungo
Michael Jungo

Reputation: 32992

The .raw only indicates that they use the raw version of the WikiText, they are regular text files containing the raw text:

We're using the raw WikiText-2 (no tokens were replaced before the tokenization).

The description of the data files options also says that they are text files. From run_language_modeling.py - L86-L88:

train_data_file: Optional[str] = field(
    default=None, metadata={"help": "The input training data file (a text file)."}
)

Therefore you can just specify your text files.

Upvotes: 1

Related Questions