Zorlel
Zorlel

Reputation: 1

Huggingface Trainer with 2 GPUs doesn't train

The Huggingface Forums are kinda dead so I'm trying it here instead. When I'm training my model with 1 A100 80GB it works fine without any problems. I am using a pretty large model but with PEFT the memory is enough to train on one GPU.

When I try to train the model without PEFT the memory is no longer enough, even with a low batch size and gradient accumulation, i am getting "CUDA out of memory". So I added another GPU and thought I could just run the trainer like before, but I'm not getting any return from the trainer.train() or any other trainer function.

It just loops endlessly without returning anything.

If it helps, these are my TrainingArguments():

training_args = TrainingArguments(
    output_dir= "~/{}".format(peft_method),
    logging_dir= "~/logs/{}".format(peft_method),
    learning_rate= 3e-4,
    per_device_train_batch_size= batch_size,
    per_device_eval_batch_size= batch_size,
    num_train_epochs= 10,
    weight_decay= 0.01,
    evaluation_strategy= "epoch",
    save_strategy= "epoch",
    load_best_model_at_end= True,
    logging_steps=logging_steps,
    optim="adamw_torch",
    save_total_limit = 1
)

and my Trainer():

trainer = Trainer(
    model = model,
    args= training_args,
    train_dataset= processed_data["train"],
    eval_dataset= processed_data["test"],
    tokenizer= tokenizer,
    data_collator= data_collator,
    compute_metrics= compute_metrics,
    callbacks=[callback]
)

My device is set to "cuda:0" and the model also gets correctly put on the GPU. I tried to put it on the other GPU as well but it also returned nothing.

Thank you for your help!

Upvotes: -1

Views: 1240

Answers (1)

TkrA
TkrA

Reputation: 668

I assume you are using QLORA + PEFT.

Make sure you use device_map="auto" when you create your model, transformers trainer will take care of the rest.

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    use_safetensors=True,
    quantization_config=bnb_config,
    trust_remote_code=True,
    device_map="auto", )

Upvotes: 1

Related Questions