Reputation: 21
My Model:
class myNet(nn.Module):
def __init__(self):
super(myNet,self).__init__()
self.act1=Dynamic_relu_b(64)
self.conv1=nn.Conv2d(3,64,3)
self.pool=nn.AdaptiveAvgPool2d(1)
self.fc=nn.Linear(128,20)
def forward(self,x):
x=self.conv1(x)
x=self.act1(x)
x=self.pool(x)
x=x.view(x.shape[0],-1)
x=self.fc(x)
return x
A code that replicates the experiment is provided:
def one_hot_smooth_label(x,num_class,smooth=0.1):
num=x.shape[0]
labels=torch.zeros((num,20))
for i in range(num):
labels[i][x[i]]=1
labels=(1-(num_class-1)/num_class*smooth)*labels+smooth/num_class
return labels
images=torch.rand((4,3,300,300))
images=images.cuda()
labels=torch.from_numpy(np.array([1,0,0,1]))
model=myNet()
model=model.cuda()
output=model(images)
labels=one_hot_smooth_label(labels,20)
labels = labels.cuda()
criterion=nn.BCEWithLogitsLoss()
loss=criterion(output,labels)
loss.backward()
The error:
RuntimeError Traceback (most recent call last)
<ipython-input-42-1268777e87e6> in <module>()
21
22 loss=criterion(output,labels)
---> 23 loss.backward()
1 frames
/usr/local/lib/python3.6/dist-packages/torch/autograd/__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables)
98 Variable._execution_engine.run_backward(
99 tensors, grad_tensors, retain_graph, create_graph,
--> 100 allow_unreachable=True) # allow_unreachable flag
101
102
RuntimeError: Function AddBackward0 returned an invalid gradient at index 1 - expected type TensorOptions(dtype=float, device=cpu, layout=Strided, requires_grad=false) but got TensorOptions(dtype=float, device=cuda:0, layout=Strided, requires_grad=false) (validate_outputs at /pytorch/torch/csrc/autograd/engine.cpp:484)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x46 (0x7fcf7711b536 in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x2d84224 (0x7fcfb1bad224 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #2: torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&) + 0x548 (0x7fcfb1baed58 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #3: torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&, bool) + 0x3d2 (0x7fcfb1bb0ce2 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #4: torch::autograd::Engine::thread_init(int) + 0x39 (0x7fcfb1ba9359 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #5: torch::autograd::python::PythonEngine::thread_init(int) + 0x38 (0x7fcfbe2e8378 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0xbd6df (0x7fcfe23416df in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #7: <unknown function> + 0x76db (0x7fcfe34236db in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #8: clone + 0x3f (0x7fcfe375c88f in /lib/x86_64-linux-gnu/libc.so.6)
After many experiments, I found that act1 in the model was the problem. If you delete act1, the error will not appear!
But I don't know why act1 has this problem.
What seems to be the wrong part of the error is requiers_grad=False, and I don't know which part set this.
This is the code about act1(Dynamic_relu_b):
class Residual(nn.Module):
def __init__(self, in_channel, R=8, k=2):
super(Residual, self).__init__()
self.avg = nn.AdaptiveAvgPool2d((1, 1))
self.relu = nn.ReLU(inplace=True)
self.R = R
self.k = k
out_channel = int(in_channel / R)
self.fc1 = nn.Linear(in_channel, out_channel)
fc_list = []
for i in range(k):
fc_list.append(nn.Linear(out_channel, 2 * in_channel))
self.fc2 = nn.ModuleList(fc_list)
def forward(self, x):
x = self.avg(x)
x = torch.squeeze(x)
x = self.fc1(x)
x = self.relu(x)
result_list = []
for i in range(self.k):
result = self.fc2[i](x)
result = 2 * torch.sigmoid(result) - 1
result_list.append(result)
return result_list
class Dynamic_relu_b(nn.Module):
def __init__(self, inchannel, R=8, k=2):
super(Dynamic_relu_b, self).__init__()
self.lambda_alpha = 1
self.lambda_beta = 0.5
self.R = R
self.k = k
self.init_alpha = torch.zeros(self.k)
self.init_beta = torch.zeros(self.k)
self.init_alpha[0] = 1
self.init_beta[0] = 1
for i in range(1, k):
self.init_alpha[i] = 0
self.init_beta[i] = 0
self.residual = Residual(inchannel)
def forward(self, input):
delta = self.residual(input)
in_channel = input.shape[1]
bs = input.shape[0]
alpha = torch.zeros((self.k, bs, in_channel))
beta = torch.zeros((self.k, bs, in_channel))
for i in range(self.k):
for j, c in enumerate(range(0, in_channel * 2, 2)):
alpha[i, :, j] = delta[i][:, c]
beta[i, :, j] = delta[i][:, c + 1]
alpha1 = alpha[0]
beta1 = beta[0]
max_result = self.dynamic_function(alpha1, beta1, input, 0)
for i in range(1, self.k):
alphai = alpha[i]
betai = beta[i]
result = self.dynamic_function(alphai, betai, input, i)
max_result = torch.max(max_result, result)
return max_result
def dynamic_function(self, alpha, beta, x, k):
init_alpha = self.init_alpha[k]
init_beta = self.init_beta[k]
alpha = init_alpha + self.lambda_alpha * alpha
beta = init_beta + self.lambda_beta * beta
bs = x.shape[0]
channel = x.shape[1]
results = torch.zeros_like(x)
for i in range(bs):
for c in range(channel):
results[i, c, :, :] = x[i, c] * alpha[i, c] + beta[i, c]
return results
How should I solve this problem?
Upvotes: 0
Views: 4673
Reputation: 32992
In PyTorch two tensors need to be on the same device to perform any mathematical operation between them. But in your case one is on the CPU and the other on the GPU. The error is not as clear as it normally is, because it happened in the backwards pass. You were (un)lucky that your forward pass did not fail. That's because there is an exception to the same device restriction, namely when using scalar values in the mathematical operation, e.g. tensor * 2
, and it even occurs when the scalar is a tensor: cpu_tensor * tensor(2, device='cuda:0')
. You are using a lot of loops and accessing individual scalars to calculate further results.
While the forward pass works like that, in the backward pass when the gradients are calculated, the gradients are multiplied with the previous ones (application of the chain rule). At that point, the two are on different devices.
You have identified that it's in the Dynamic_relu_b
. In there you need to make sure that every tensor that you create, is on the same device as the input. The two tensors you create in the forward method are:
alpha = torch.zeros((self.k, bs, in_channel))
beta = torch.zeros((self.k, bs, in_channel))
These are created on the CPU, but your input is on the GPU, so you need to put them on the GPU as well. To be generic, it should be put onto the device where the input is located.
alpha = torch.zeros((self.k, bs, in_channel), device=input.device)
beta = torch.zeros((self.k, bs, in_channel), device=input.device)
The biggest problem in your code are the loops. Not only did they obfuscated a bug, they are very harmful for performance, since they can neither be parallelised nor vectorised, and those are the reasons why GPUs are so fast. I'm certain that these loops can be replaced with more efficient operations, but you'll have to get out of the mindset of creating an empty tensor and then filling it one by one.
I'll give you one example from dynamic_function
:
results = torch.zeros_like(x)
for i in range(bs):
for c in range(channel):
results[i, c, :, :] = x[i, c] * alpha[i, c] + beta[i, c]
You're multiplying x
(size: [bs, channel, height, width]) with alpha
(size: [bs, channel]), where every plane (height, width) of x
is multiplied by a different element of alpha (scalar). That would be same as doing an element-wise multiplication with a tensor of the same size as the plane [height, width], but where all elements are the same scalar.
Thankfully, you don't need to repeat them yourself, since singular dimensions (dimensions with size 1) are automatically expanded to match the size of the other tensor, see PyTorch - Broadcasting Semantics for details. That means you only need to reshape alpha
to have size [bs, channel, 1, 1].
The loop can therefore be replaced with:
results = x * alpha.view(bs, channel, 1, 1) + beta.view(bs, channel, 1, 1)
By eliminating that loop, you gain a lot of performance, and your initial error just got much clearer, because the forward pass would fail with the following message:
File "main.py", line 78, in dynamic_function
results = x * alpha.view(bs, channel, 1, 1) + beta.view(bs, channel, 1, 1)
RuntimeError: expected device cuda:0 but got device cpu
Now you would know that one of these is on the CPU and the other on the GPU.
Upvotes: 3