ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9)

Question

question about pytorch distributed training. Using A6000(48G memory), 2 gpu, normal When using 4090(24G memory), 2 gpu training is normal; When using 4098 for 4 gpu training, sending process xxx closing signal. When using 4090 regardless of how many gpu are used, it takes up a lot of RAM during training enter image description here

I'm trying model using clip to extract feature ( clip is frozen, and load to cpu first and moved to device). I have set batchsize=1.

I have tried to set find_unused_parameters=True, decrease batchsize=1, but not work.

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9)

Answers (0)

Related Questions