vpap
vpap

Reputation: 1547

RuntimeError: module must have its parameters and buffers on device cuda:1 (device_ids[0]) but found one of them on device: cuda:2

I have 4 GPUs (0,1,2,3) and I want to run one Jupyter notebook on GPU 2 and another one on GPU 0. Thus, after executing,

 export CUDA_VISIBLE_DEVICES=0,1,2,3

for the GPU 2 notebook I do,

device = torch.device( f'cuda:{2}' if torch.cuda.is_available() else 'cpu')
device, torch.cuda.device_count(), torch.cuda.is_available(), torch.cuda.current_device(), torch.cuda.get_device_properties(1)

and after creating a new model or loading one,

model = nn.DataParallel( model, device_ids = [ 0, 1, 2, 3])
model = model.to( device)

Then, when I start training the model, I get,

RuntimeError                              Traceback (most recent call last)
<ipython-input-18-849ffcb53e16> in <module>
 46             with torch.set_grad_enabled( phase == 'train'):
 47                 # [N, Nclass, H, W]
 ---> 48                 prediction = model(X)
 49                 # print( prediction.shape, y.shape)
 50                 loss_matrix = criterion( prediction, y)

~/.local/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
491             result = self._slow_forward(*input, **kwargs)
492         else:
--> 493             result = self.forward(*input, **kwargs)
494         for hook in self._forward_hooks.values():
495             hook_result = hook(self, input, result)

~/.local/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py in forward(self, *inputs, **kwargs)
144                 raise RuntimeError("module must have its parameters and buffers "
145                                    "on device {} (device_ids[0]) but found one of "
--> 146                                    "them on device: {}".format(self.src_device_obj, t.device))
147 
148         inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids)

RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:2

Upvotes: 21

Views: 44093

Answers (5)

questionto42
questionto42

Reputation: 9532

If the rest of the answers here does not help and if your training puts the parameters to another GPU than coded:

Upvotes: 0

Ali Ganjbakhsh
Ali Ganjbakhsh

Reputation: 801

this error happened when using the torch, model and data both are not on cuda:

try some code like this to model and data set on cuda

model = model.toDevice('cuda')
images = images.toDevice('cuda')

Upvotes: 0

Cyrus
Cyrus

Reputation: 483

model = nn.DataParallel(model, device_ids = [ 0, 1, 2, 3]).cuda()

works like magic for me.

BTW, according the doc: input_var can be on any device, including CPU

Upvotes: 0

Coddy
Coddy

Reputation: 566

For me even the following works:

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
if torch.cuda.device_count() > 1:
    print("Let's use", torch.cuda.device_count(), "GPUs!")
    network = nn.DataParallel(network)

network.to(device)
tnsr = tnsr.to(device)

Upvotes: -1

jodag
jodag

Reputation: 22204

DataParallel requires every input tensor be provided on the first device in its device_ids list.

It basically uses that device as a staging area before scattering to the other GPUs and it's the device where final outputs are gathered before returning from forward. If you want device 2 to be the primary device then you just need to put it at the front of the list as follows

model = nn.DataParallel(model, device_ids = [2, 0, 1, 3])
model.to(f'cuda:{model.device_ids[0]}')

After which all tensors provided to model should be on the first device as well.

x = ... # input tensor
x = x.to(f'cuda:{model.device_ids[0]}')
y = model(x)

Upvotes: 35

Related Questions