Reputation: 747
I have a large file (1 GB+) with a mix of short and long texts (format: wikitext-2) for fine tuning the masked language model with bert-large-uncased
as baseline model. I followed the instruction at https://github.com/huggingface/transformers/tree/master/examples/language-modeling. The process seems to be stuck at a stage "Creating features from dataset file at <file loc>
". I am unsure what is wrong, is it really stuck or does it take really long for file of this size?
Command looks pretty much this:
export TRAIN_FILE=/path/to/dataset/my.train.raw
export TEST_FILE=/path/to/dataset/my.test.raw
python run_language_modeling.py \
--output_dir=local_output_dir \
--model_type=bert \
--model_name_or_path=local_bert_dir \
--do_train \
--train_data_file=$TRAIN_FILE \
--do_eval \
--eval_data_file=$TEST_FILE \
--mlm
Added: The job is running on CPU
Upvotes: 0
Views: 992
Reputation: 278
Since the file is huge, I would strongly recommend trying your code on a toy dataset before running it on your actual large data. This will be helpful when you debug too.
If your system has multi-cores, please follow some multi-processing strategies. Take a look at https://github.com/PyTorchLightning/pytorch-lightning.
Upvotes: 1