How does one use 3D convolutions on standard 3 channel images?

I am trying to use 3d conv on cifar10 data set (just for fun). I see the docs that we usually have the input be 5d tensors (N,C,D,H,W). Am I really forced to pass 5 dimensional data necessarily?

The reason I am skeptical is because 3D convolutions simply mean my conv moves across 3 dimensions/directions. So technically I could have 3d 4d 5d or even 100d tensors and then should all work as long as its at least a 3d tensor. Is that not right?

I tried it real quick and it did give an error:

import torch


def conv3d_example():
    N,C,H,W = 1,3,7,7
    img = torch.randn(N,C,H,W)
    ##
    in_channels, out_channels = 1, 4
    kernel_size = (2,3,3)
    conv = torch.nn.Conv3d(in_channels, out_channels, kernel_size)
    ##
    out = conv(img)
    print(out)
    print(out.size())

##
conv3d_example()
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-3-29c73923cc64> in <module>
     15 
     16 ##
---> 17 conv3d_example()

<ipython-input-3-29c73923cc64> in conv3d_example()
     10     conv = torch.nn.Conv3d(in_channels, out_channels, kernel_size)
     11     ##
---> 12     out = conv(img)
     13     print(out)
     14     print(out.size())

~/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    491             result = self._slow_forward(*input, **kwargs)
    492         else:
--> 493             result = self.forward(*input, **kwargs)
    494         for hook in self._forward_hooks.values():
    495             hook_result = hook(self, input, result)

~/anaconda3/lib/python3.7/site-packages/torch/nn/modules/conv.py in forward(self, input)
    474                             self.dilation, self.groups)
    475         return F.conv3d(input, self.weight, self.bias, self.stride,
--> 476                         self.padding, self.dilation, self.groups)
    477 
    478 

RuntimeError: Expected 5-dimensional input for 5-dimensional weight 4 1 2 3, but got 4-dimensional input of size [1, 3, 7, 7] instead

cross posted:

Upvotes: 1

Answers (2)

Hossein

Reputation: 26004

Let's review what we know, for a 3D convolution we will need to address these:

N   For mini batch (or how many sequences do we want to feed at one go)
Cin For the number of channels in our input (if our image is rgb, this is 3)
D   For depth or in other words the number of images/frames in one input sequence (if we are dealing videos, this is the number of frames)
H   For the height of the image/frame
W   For the width of the image/frame

So now that we know what's needed, it should be easy to get this going.

In your example, you are missing the depth in the input and since you have a single rgb image, then the depth or time dimension of your input is 1.
You also have a wrong in_channels. its C (in your case 3, as you have rgb image it seems)
You also need to fix your kernel dimensions as it has the wrong depth dimension as well. again since we are dealing with a single image and not a sequence of images, the depth is 1. were you to have a depth of k in your input, then you could choose any values 1<=n<=k in your kernel.
Now you should be able to successfully run your snippet.

def conv3d_example():
    # for deterministic output only
    torch.random.manual_seed(0)
    N,C,D,H,W = 1,3,1,7,7
    img = torch.randn(N,C,D,H,W)
    ##
    in_channels = C
    out_channels = 4
    kernel_size = (1,3,3)
    conv = torch.nn.Conv3d(in_channels, out_channels, kernel_size)
    ##
    out = conv(img)
    print(out)
    print(out.size())

results in :

In [3]: conv3d_example()
tensor([[[[[ 0.9368, -0.6973,  0.1359,  0.2023, -0.3149],
           [-0.4601,  0.2668,  0.3414,  0.6624, -0.6251],
           [-1.0212, -0.0767,  0.2693,  0.9537, -0.4375],
           [ 0.6981, -0.1586, -0.3076,  0.1973, -0.2972],
           [-0.0747, -0.8704,  0.1757, -0.4161, -0.3464]]],


         [[[-0.4710, -0.7841, -1.1406, -0.6413,  0.9183],
           [-0.2473,  0.2532, -1.0443, -0.8634, -0.8797],
           [ 0.5243, -0.4383,  0.1375, -0.7561,  0.7913],
           [-1.1216, -0.4496,  0.5481,  0.1034, -1.0036],
           [-0.0941, -0.1458, -0.1438, -1.0257, -0.4392]]],


         [[[ 0.5196,  0.3102,  0.5299, -0.0126,  0.7945],
           [ 0.3721, -1.3339, -0.5849, -0.2701,  0.4842],
           [-0.2661,  0.9777, -0.3328, -0.1730, -0.6360],
           [ 0.4960,  0.2348,  0.5183, -0.2935,  0.1777],
           [-0.2672,  0.0233, -0.5573,  0.8366,  0.6082]]],


         [[[-0.1565, -1.7331, -0.2015, -1.1708,  0.3099],
           [-0.3667,  0.1985, -0.4940,  0.4044, -0.8000],
           [ 0.2814, -0.6172, -0.4466, -0.6098,  0.0983],
           [-0.5814, -0.2825, -0.1321,  0.5536, -0.4767],
           [-0.3337,  0.3160, -0.4748, -0.7694, -0.0705]]]]],
       grad_fn=<SlowConv3DBackward0>)
torch.Size([1, 4, 1, 5, 5])

Upvotes: 1

jodag

Reputation: 22304

Consider the following scenario. You have a 3 channel NxN image. This image will have size of 3xNxN in pytorch (ignoring the batch dimension for now).

Say you pass this image to a 2D convolution layer with no bias, kernel size 5x5, padding of 2, and input/output channels of 3 and 10 respectively.

What's actually happening when we apply this layer to the input image?

You can think of it like this...

For each of the 10 output channels there is a kernel of size 3x5x5. A 3D convolution is applied to the 3xNxN input image using this kernel, which can be thought of as unpadded in the first dimension. The result of this convolution is a 1xNxN feature map.

Since there are 10 output layers, there are 10 of the 3x5x5 kernels. After all kernels have been applied the outputs are stacked into a single 10xNxN tensor.

So really, in the classical sense, a 2D convolution layer is already performing a 3D convolution.

Similarly for a 3D convolution layer, its really doing a 4D convolution, which is why you need 5 dimensional input.

Upvotes: 2

How does one use 3D convolutions on standard 3 channel images?

Answers (2)

Related Questions