kirstain.yuval
kirstain.yuval

Reputation: 23

How to check the root cause of CUDA out of memory issue in the middle of training?

I'm running roberta on huggingface language_modeling.py. After doing 400 steps I suddenly get a CUDA out of memory issue. Don't know how to deal with it. Can you please help? Thanks

Upvotes: 1

Views: 1594

Answers (2)

kirstain.yuval
kirstain.yuval

Reputation: 23

My problem was that I didn't check the size of my GPU memory with comparison to the sizes of samples. I had a lot of pretty small samples and after many iterations a large one. My bad. Thank you and remember to check these things if it happens to you to.

Upvotes: 0

luk_dev
luk_dev

Reputation: 154

This can have multiple reasons. If you only get it after a few iterations, it might be that you don't free the computational graphs. Do you use loss.backward(retain_graph=True) or something similar?

Also, when you're running inference, be sure to use

with torch.no_grad():
    model.forward(...)

Otherwise the computational graphs are saved there as well and potentially never freed since you never call backward() on them.

Upvotes: 2

Related Questions