Huggingface language modeling stuck at data reading phase

Question

I have a large file (1 GB+) with a mix of short and long texts (format: wikitext-2) for fine tuning the masked language model with bert-large-uncased as baseline model. I followed the instruction at https://github.com/huggingface/transformers/tree/master/examples/language-modeling. The process seems to be stuck at a stage "Creating features from dataset file at ". I am unsure what is wrong, is it really stuck or does it take really long for file of this size?

Command looks pretty much this:

export TRAIN_FILE=/path/to/dataset/my.train.raw
export TEST_FILE=/path/to/dataset/my.test.raw

python run_language_modeling.py \
    --output_dir=local_output_dir \
    --model_type=bert \
    --model_name_or_path=local_bert_dir \
    --do_train \
    --train_data_file=$TRAIN_FILE \
    --do_eval \
    --eval_data_file=$TEST_FILE \
    --mlm

Added: The job is running on CPU

Huggingface language modeling stuck at data reading phase

Answers (1)

Related Questions