Reputation: 1
question about pytorch distributed training. Using A6000(48G memory), 2 gpu, normal When using 4090(24G memory), 2 gpu training is normal; When using 4098 for 4 gpu training, sending process xxx closing signal. When using 4090 regardless of how many gpu are used, it takes up a lot of RAM during training enter image description here
I'm trying model using clip to extract feature ( clip is frozen, and load to cpu first and moved to device). I have set batchsize=1.
I have tried to set find_unused_parameters=True, decrease batchsize=1, but not work.
Upvotes: 0
Views: 99