CARTman
CARTman

Reputation: 747

Huggingface language modeling stuck at data reading phase

I have a large file (1 GB+) with a mix of short and long texts (format: wikitext-2) for fine tuning the masked language model with bert-large-uncased as baseline model. I followed the instruction at https://github.com/huggingface/transformers/tree/master/examples/language-modeling. The process seems to be stuck at a stage "Creating features from dataset file at <file loc>". I am unsure what is wrong, is it really stuck or does it take really long for file of this size?

Command looks pretty much this:

export TRAIN_FILE=/path/to/dataset/my.train.raw
export TEST_FILE=/path/to/dataset/my.test.raw

python run_language_modeling.py \
    --output_dir=local_output_dir \
    --model_type=bert \
    --model_name_or_path=local_bert_dir \
    --do_train \
    --train_data_file=$TRAIN_FILE \
    --do_eval \
    --eval_data_file=$TEST_FILE \
    --mlm

Added: The job is running on CPU

Upvotes: 0

Views: 992

Answers (1)

user12769533
user12769533

Reputation: 278

Since the file is huge, I would strongly recommend trying your code on a toy dataset before running it on your actual large data. This will be helpful when you debug too.

If your system has multi-cores, please follow some multi-processing strategies. Take a look at https://github.com/PyTorchLightning/pytorch-lightning.

Upvotes: 1

Related Questions