Reputation: 1148
I have a 3D tensor with each channel to be convolved with the one single kernel. From a quick search, the fastest way to do this was to use grouped convolution with number of groups to be the number of channels.
Here is a small reproducible example:
import torch
import torch.nn as nn
torch.manual_seed(0)
x = torch.rand(1, 3, 3, 3)
first = x[:, 0:1, ...]
second = x[:, 1:2, ...]
third = x[:, 2:3, ...]
kernel = nn.Conv2d(1, 1, 3)
conv = nn.Conv2d(3, 3, 3, groups=3)
conv.weight.data = kernel.weight.data.repeat(3, 1, 1, 1)
conv.bias.data = kernel.bias.data.repeat(3)
>>> conv(x)
tensor([[[[-1.0085]],
[[-1.0068]],
[[-1.0451]]]], grad_fn=<MkldnnConvolutionBackward>)
>>> kernel(first), kernel(second), kernel(third)
(tensor([[[[-1.0085]]]], grad_fn=<ThnnConv2DBackward>),
tensor([[[[-1.0068]]]], grad_fn=<ThnnConv2DBackward>),
tensor([[[[-1.0451]]]], grad_fn=<ThnnConv2DBackward>))
Which you can see perfectly works.
Now coming to my question. I need to do backprop on this (kernel
object). While doing this, each weight of the conv
gets its own update. But actually, conv
is made up of kernel
repeated 3 times. At the end I require only an updated kernel
. How do I do this?
PS: I need to optimize for speed
Upvotes: 1
Views: 460
Reputation: 40708
To give a reply to your own answer, averaging the weights is actually not an accurate method. You can operate on the gradients by summing them (see below) but not on the weights.
For a given convolution layer when using groups, you can think of it as passing a number of groups
elements through the kernel. As such the gradient is accumulated, not averaged. The resulting gradient is effectively the sum of gradients:
kernel.weight.grad = conv.weight.grad.sum(0, keepdim=True)
You can verify this with pen and paper, if you average the weights, you end up averaging the weights of the previous step and the gradients of each kernel. This is not even true for more advanced optimizers which won't rely solely on a simple update scheme like θ_t = θ _t-1 - lr*grad
. Therefore, you should be working with gradients directly, not the resulting weights.
One alternative way you can solve this is by implementing you own shared kernel convolution module. This can be done in the following two steps:
nn.Module
initializer.Tensor.expand
instead of Tensor.repeat
(the latter makes a copy). You should not make copies, they must remain references of the same underlying data i.e. your single kernel. Then, you can apply the grouped convolution with more flexibility using the functional variant of the paper torch.nn.functional.conv2d
.From there you can backpropagate anytime, and the gradient will accumulate on the single underlying weight (and bias) parameter.
Let's see it in practice:
class SharedKernelConv2d(nn.Module):
def __init__(self, kernel_size, groups, **kwargs):
super().__init__()
self.kwargs = kwargs
self.groups = groups
self.weight = nn.Parameter(torch.rand(1, 1, kernel_size, kernel_size))
self.bias = nn.Parameter(torch.rand(1))
def forward(self, x):
return F.conv2d(x,
weight=self.weight.expand(self.groups, -1, -1, -1),
bias=self.bias.expand(self.groups),
groups=self.groups,
**self.kwargs)
This is a very simple implementation yet is effective. Let's compare the two:
>>> sharedconv = SharedKernelConv2d(3, groups=3):
With the other method:
>>> conv = nn.Conv2d(3, 3, 3, groups=3)
>>> conv.weight.data = torch.clone(conv.weight).repeat(3, 1, 1, 1)
>>> conv.bias.data = torch.clone(conv.bias).repeat(3)
Backpropagate on the sharedconv
layer:
>>> sharedconv(x).mean().backward()
>>> sharedconv.weight.grad
tensor([[[[0.7920, 0.6585, 0.8721],
[0.6257, 0.3358, 0.6995],
[0.5230, 0.6542, 0.3852]]]])
>>> sharedconv.bias.grad
tensor([1.])
Compared to summing the gradient on the repeated tensor:
>>> conv(x).mean().backward()
>>> conv.weight.grad.sum(0, keepdim=True)
tensor([[[[0.7920, 0.6585, 0.8721],
[0.6257, 0.3358, 0.6995],
[0.5230, 0.6542, 0.3852]]]])
>>> conv.bias.grad.sum(0, keepdim=True)
tensor([1.])
With SharedKernelConv2d
you don't have to worry about updating the gradient with the sum of kernel gradients each time. The accumulation takes place automatically by having kept the reference to self.weight
and self.bias
with Tensor.expand
.
Upvotes: 2
Reputation: 1148
One possible answer is to take a mean after the gradient updates like so
kernel.weight.data = conv.weight.data.mean(0).unsqueeze(0)
Is this the best way to do it? Or is this even right in the first place?
Upvotes: 0