nlp4892
nlp4892

Reputation: 89

Huggingface trainer leaves residual memory

I am currently trying to use huggingface trainer in a for-loop esque setting: I am training on single data examples and then evaluating for each example in my dataset - so I initialize trainer, and call trainer.train() multiple times in my script. The reason I am using trainer is due to its ease of deepspeed, which I need to fit a larger model on my GPU.

Right after calling trainer.train(), the GPU usage on my GPU spikes up to ~27GB, and stays there permanently - so in future calls of the loop, this memory is still there and combines with future trainer memory to cause a OOM error. I have tried deleting the trainer, its optimizer, its model, the model itself, and use torch.cuda.empty_cache() and gc.collect() often in my code. For example, here is the code at the end of each for loop:

    del model
    torch.cuda.empty_cache()
    gc.collect()

However, I still have not managed to locate what is causing this residual memory. 27GB is about how much the full model should take to load on the GPU, but deleting it does not do anything. Is there any way I can fix this? I suspect using del model only removes it from CPU memory and does not free the GPU memory, but I'm not quite sure how to make sure everything is deleted properly before the next step in the for loop.

Would very much appreciate help with this.

Example Code:

 tokenizer = LlamaTokenizer.from_pretrained(llama_path)
 for example in dataset:

    model = LlamaForCausalLM.from_pretrained(llama_path) #download Llama-7B
    train_dataset = CustomDataset(example, ...)
    trainer = Trainer(model, 
                      training_args, #includes directory to deepspeed config
                      train_dataset, 
                      tokenizer)
     trainer.train()
     *** evaluate model***
     logits = model.logits(example)
     output.append(logits) #goal is to get these logits for each example

Upvotes: 2

Views: 1479

Answers (1)

Omel
Omel

Reputation: 11

yes I had the same issue In my case what worked was deleting the trainer itself

del trainer

and that solved it.

You can also check what other variable might be the culprit by listing the in scope variables using:

dir()

Check the list and see if there is something suspicious

Upvotes: 0

Related Questions