Reputation: 569
I want to train BERT on a target corpus. I am looking at this HuggingFace implementation. They are using .raw files for the training data. If I have .txt files of my training data, how can I use their implementation?
Upvotes: 0
Views: 433
Reputation: 32992
The .raw
only indicates that they use the raw version of the WikiText, they are regular text files containing the raw text:
We're using the raw WikiText-2 (no tokens were replaced before the tokenization).
The description of the data files options also says that they are text files. From run_language_modeling.py - L86-L88:
train_data_file: Optional[str] = field(
default=None, metadata={"help": "The input training data file (a text file)."}
)
Therefore you can just specify your text files.
Upvotes: 1