Reputation: 173
I'm encountering a challenging issue with GPU memory not being released properly between successive training phases in PyTorch, leading to CUDA out of memory errors.
My project involves fine-tuning a model in two consecutive phases:
The code structure is as follows:
from transformers import AutoModelForCausalLM
# Model and data loader initialization
model = AutoModelForCausalLM.from_pretrained(args.model) # fig.1
fp_data_loader = data_loader(fp_dataset)
sft_data_loader = data_loader(sft_dataset)
# First phase of training
fp_model, fp_loss = train_loop(model, fp_data_loader) #fig.2
fp_model.module.save_pretrained(checkpoint_dir)
# Attempt to release GPU memory
del model, fp_data_loader,
# del fp_model
# fp_model = AutoModelForCausalLM.from_pretrained(checkpoint_dir)
gc.collect()
torch.cuda.empty_cache() #fig.3
# Second phase of training
sft_model, sft_loss = train_loop(fp_model, sft_data_loader)
fig.3: after empty_cache (capture issue: ignore gpu0's 77762/81920, it is 57524/81920)
My expectation was that the gpu allocation of fig.1 would be like after empty_cache, but there is quite a lot of gpu memory allocated as in fig.3
Despite explicitly deleting the model and data loader used in the first phase and calling gc.collect() and torch.cuda.empty_cache(), the GPU memory does not seem to be fully released. As a result, when initiating the second training phase, I'm faced with a CUDA out of memory error.
I also wrap my model, optimizer, and data loaders with accelerator.prepare() for mixed precision and distributed training, which might be relevant.
Has anyone faced similar issues or has suggestions on ensuring that GPU memory is properly released between training phases? I've considered completely restarting the process between phases but would prefer a cleaner solution if possible.
PS: This question was machine translated from Korean, apologies if there is awkward language.
Upvotes: 2
Views: 354
Reputation: 1
Scheduling successive workoads on multi GPU environments can be quite a pain, especially for workloads with multi-phase dependencies. If you are still encountering challenges around it, Run:ai is an alternative that helps companies manage workloads efficiently and schedule them automatically on clusters. We can chat on Linkedin if you think it might be useful
Upvotes: -3