S P Sharan
S P Sharan

Reputation: 1148

Backprop for Repeated Convolution using Grouped Convolution

I have a 3D tensor with each channel to be convolved with the one single kernel. From a quick search, the fastest way to do this was to use grouped convolution with number of groups to be the number of channels.

Here is a small reproducible example:

import torch
import torch.nn as nn
torch.manual_seed(0)


x = torch.rand(1, 3, 3, 3)
first  = x[:, 0:1, ...]
second = x[:, 1:2, ...]
third  = x[:, 2:3, ...]

kernel = nn.Conv2d(1, 1, 3)
conv = nn.Conv2d(3, 3, 3, groups=3)
conv.weight.data = kernel.weight.data.repeat(3, 1, 1, 1)
conv.bias.data = kernel.bias.data.repeat(3)

>>> conv(x)
tensor([[[[-1.0085]],

         [[-1.0068]],

         [[-1.0451]]]], grad_fn=<MkldnnConvolutionBackward>)

>>> kernel(first), kernel(second), kernel(third)
(tensor([[[[-1.0085]]]], grad_fn=<ThnnConv2DBackward>),
 tensor([[[[-1.0068]]]], grad_fn=<ThnnConv2DBackward>),
 tensor([[[[-1.0451]]]], grad_fn=<ThnnConv2DBackward>))

Which you can see perfectly works.

Now coming to my question. I need to do backprop on this (kernel object). While doing this, each weight of the conv gets its own update. But actually, conv is made up of kernel repeated 3 times. At the end I require only an updated kernel. How do I do this?

PS: I need to optimize for speed

Upvotes: 1

Views: 460

Answers (2)

Ivan
Ivan

Reputation: 40708

To give a reply to your own answer, averaging the weights is actually not an accurate method. You can operate on the gradients by summing them (see below) but not on the weights.


For a given convolution layer when using groups, you can think of it as passing a number of groups elements through the kernel. As such the gradient is accumulated, not averaged. The resulting gradient is effectively the sum of gradients:

kernel.weight.grad = conv.weight.grad.sum(0, keepdim=True)

You can verify this with pen and paper, if you average the weights, you end up averaging the weights of the previous step and the gradients of each kernel. This is not even true for more advanced optimizers which won't rely solely on a simple update scheme like θ_t = θ _t-1 - lr*grad. Therefore, you should be working with gradients directly, not the resulting weights.

One alternative way you can solve this is by implementing you own shared kernel convolution module. This can be done in the following two steps:

  • define your single kernel in the nn.Module initializer.
  • in the forward definition, make a view of your kernel to match the number groups. Use Tensor.expand instead of Tensor.repeat (the latter makes a copy). You should not make copies, they must remain references of the same underlying data i.e. your single kernel. Then, you can apply the grouped convolution with more flexibility using the functional variant of the paper torch.nn.functional.conv2d.

From there you can backpropagate anytime, and the gradient will accumulate on the single underlying weight (and bias) parameter.

Let's see it in practice:

class SharedKernelConv2d(nn.Module):
   def __init__(self, kernel_size, groups, **kwargs):
      super().__init__()
      self.kwargs = kwargs
      self.groups = groups
      self.weight = nn.Parameter(torch.rand(1, 1, kernel_size, kernel_size))
      self.bias = nn.Parameter(torch.rand(1))

   def forward(self, x):
      return F.conv2d(x, 
         weight=self.weight.expand(self.groups, -1, -1, -1), 
         bias=self.bias.expand(self.groups), 
         groups=self.groups, 
         **self.kwargs)

This is a very simple implementation yet is effective. Let's compare the two:

>>> sharedconv = SharedKernelConv2d(3, groups=3):

With the other method:

>>> conv = nn.Conv2d(3, 3, 3, groups=3)
>>> conv.weight.data = torch.clone(conv.weight).repeat(3, 1, 1, 1)
>>> conv.bias.data = torch.clone(conv.bias).repeat(3)

Backpropagate on the sharedconv layer:

>>> sharedconv(x).mean().backward()

>>> sharedconv.weight.grad
tensor([[[[0.7920, 0.6585, 0.8721],
          [0.6257, 0.3358, 0.6995],
          [0.5230, 0.6542, 0.3852]]]])
>>> sharedconv.bias.grad
tensor([1.])

Compared to summing the gradient on the repeated tensor:

>>> conv(x).mean().backward()

>>> conv.weight.grad.sum(0, keepdim=True)
tensor([[[[0.7920, 0.6585, 0.8721],
          [0.6257, 0.3358, 0.6995],
          [0.5230, 0.6542, 0.3852]]]])
>>> conv.bias.grad.sum(0, keepdim=True)
tensor([1.])

With SharedKernelConv2d you don't have to worry about updating the gradient with the sum of kernel gradients each time. The accumulation takes place automatically by having kept the reference to self.weight and self.bias with Tensor.expand.

Upvotes: 2

S P Sharan
S P Sharan

Reputation: 1148

One possible answer is to take a mean after the gradient updates like so

kernel.weight.data = conv.weight.data.mean(0).unsqueeze(0)

Is this the best way to do it? Or is this even right in the first place?

Upvotes: 0

Related Questions