How to Properly Manage GPU Memory Between Successive PyTorch Training Phases using accelerate?

Question

I'm encountering a challenging issue with GPU memory not being released properly between successive training phases in PyTorch, leading to CUDA out of memory errors.

My project involves fine-tuning a model in two consecutive phases:

first on a FP (Further pretraining Phase) dataset,
and then on an SFT (Supervised Fine-tuning) dataset.

The code structure is as follows:

from transformers import AutoModelForCausalLM

# Model and data loader initialization
model = AutoModelForCausalLM.from_pretrained(args.model)  # fig.1
fp_data_loader = data_loader(fp_dataset)
sft_data_loader = data_loader(sft_dataset)

# First phase of training
fp_model, fp_loss = train_loop(model, fp_data_loader) #fig.2
fp_model.module.save_pretrained(checkpoint_dir)

# Attempt to release GPU memory
del model, fp_data_loader,
# del fp_model
# fp_model = AutoModelForCausalLM.from_pretrained(checkpoint_dir)
gc.collect()
torch.cuda.empty_cache()  #fig.3

# Second phase of training
sft_model, sft_loss = train_loop(fp_model, sft_data_loader)

fig.1: when model is on gpu

fig.2: during training

fig.3: after empty_cache (capture issue: ignore gpu0's 77762/81920, it is 57524/81920)

My expectation was that the gpu allocation of fig.1 would be like after empty_cache, but there is quite a lot of gpu memory allocated as in fig.3

Despite explicitly deleting the model and data loader used in the first phase and calling gc.collect() and torch.cuda.empty_cache(), the GPU memory does not seem to be fully released. As a result, when initiating the second training phase, I'm faced with a CUDA out of memory error.

I also wrap my model, optimizer, and data loaders with accelerator.prepare() for mixed precision and distributed training, which might be relevant.

Has anyone faced similar issues or has suggestions on ensuring that GPU memory is properly released between training phases? I've considered completely restarting the process between phases but would prefer a cleaner solution if possible.

PS: This question was machine translated from Korean, apologies if there is awkward language.

How to Properly Manage GPU Memory Between Successive PyTorch Training Phases using accelerate?

Answers (1)

Related Questions