Reputation: 5257
I am trying to use 3d conv on cifar10 data set (just for fun). I see the docs that we usually have the input be 5d tensors (N,C,D,H,W). Am I really forced to pass 5 dimensional data necessarily?
The reason I am skeptical is because 3D convolutions simply mean my conv moves across 3 dimensions/directions. So technically I could have 3d 4d 5d or even 100d tensors and then should all work as long as its at least a 3d tensor. Is that not right?
I tried it real quick and it did give an error:
import torch
def conv3d_example():
N,C,H,W = 1,3,7,7
img = torch.randn(N,C,H,W)
##
in_channels, out_channels = 1, 4
kernel_size = (2,3,3)
conv = torch.nn.Conv3d(in_channels, out_channels, kernel_size)
##
out = conv(img)
print(out)
print(out.size())
##
conv3d_example()
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-3-29c73923cc64> in <module>
15
16 ##
---> 17 conv3d_example()
<ipython-input-3-29c73923cc64> in conv3d_example()
10 conv = torch.nn.Conv3d(in_channels, out_channels, kernel_size)
11 ##
---> 12 out = conv(img)
13 print(out)
14 print(out.size())
~/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
491 result = self._slow_forward(*input, **kwargs)
492 else:
--> 493 result = self.forward(*input, **kwargs)
494 for hook in self._forward_hooks.values():
495 hook_result = hook(self, input, result)
~/anaconda3/lib/python3.7/site-packages/torch/nn/modules/conv.py in forward(self, input)
474 self.dilation, self.groups)
475 return F.conv3d(input, self.weight, self.bias, self.stride,
--> 476 self.padding, self.dilation, self.groups)
477
478
RuntimeError: Expected 5-dimensional input for 5-dimensional weight 4 1 2 3, but got 4-dimensional input of size [1, 3, 7, 7] instead
cross posted:
Upvotes: 1
Views: 4492
Reputation: 26004
Let's review what we know, for a 3D convolution we will need to address these:
N For mini batch (or how many sequences do we want to feed at one go)
Cin For the number of channels in our input (if our image is rgb, this is 3)
D For depth or in other words the number of images/frames in one input sequence (if we are dealing videos, this is the number of frames)
H For the height of the image/frame
W For the width of the image/frame
So now that we know what's needed, it should be easy to get this going.
in_channels
. its C (in your case 3, as you have rgb image it seems)1<=n<=k
in your kernel.def conv3d_example():
# for deterministic output only
torch.random.manual_seed(0)
N,C,D,H,W = 1,3,1,7,7
img = torch.randn(N,C,D,H,W)
##
in_channels = C
out_channels = 4
kernel_size = (1,3,3)
conv = torch.nn.Conv3d(in_channels, out_channels, kernel_size)
##
out = conv(img)
print(out)
print(out.size())
results in :
In [3]: conv3d_example()
tensor([[[[[ 0.9368, -0.6973, 0.1359, 0.2023, -0.3149],
[-0.4601, 0.2668, 0.3414, 0.6624, -0.6251],
[-1.0212, -0.0767, 0.2693, 0.9537, -0.4375],
[ 0.6981, -0.1586, -0.3076, 0.1973, -0.2972],
[-0.0747, -0.8704, 0.1757, -0.4161, -0.3464]]],
[[[-0.4710, -0.7841, -1.1406, -0.6413, 0.9183],
[-0.2473, 0.2532, -1.0443, -0.8634, -0.8797],
[ 0.5243, -0.4383, 0.1375, -0.7561, 0.7913],
[-1.1216, -0.4496, 0.5481, 0.1034, -1.0036],
[-0.0941, -0.1458, -0.1438, -1.0257, -0.4392]]],
[[[ 0.5196, 0.3102, 0.5299, -0.0126, 0.7945],
[ 0.3721, -1.3339, -0.5849, -0.2701, 0.4842],
[-0.2661, 0.9777, -0.3328, -0.1730, -0.6360],
[ 0.4960, 0.2348, 0.5183, -0.2935, 0.1777],
[-0.2672, 0.0233, -0.5573, 0.8366, 0.6082]]],
[[[-0.1565, -1.7331, -0.2015, -1.1708, 0.3099],
[-0.3667, 0.1985, -0.4940, 0.4044, -0.8000],
[ 0.2814, -0.6172, -0.4466, -0.6098, 0.0983],
[-0.5814, -0.2825, -0.1321, 0.5536, -0.4767],
[-0.3337, 0.3160, -0.4748, -0.7694, -0.0705]]]]],
grad_fn=<SlowConv3DBackward0>)
torch.Size([1, 4, 1, 5, 5])
Upvotes: 1
Reputation: 22304
Consider the following scenario. You have a 3 channel NxN image. This image will have size of 3xNxN in pytorch (ignoring the batch dimension for now).
Say you pass this image to a 2D convolution layer with no bias, kernel size 5x5, padding of 2, and input/output channels of 3 and 10 respectively.
What's actually happening when we apply this layer to the input image?
You can think of it like this...
For each of the 10 output channels there is a kernel of size 3x5x5. A 3D convolution is applied to the 3xNxN input image using this kernel, which can be thought of as unpadded in the first dimension. The result of this convolution is a 1xNxN feature map.
Since there are 10 output layers, there are 10 of the 3x5x5 kernels. After all kernels have been applied the outputs are stacked into a single 10xNxN tensor.
So really, in the classical sense, a 2D convolution layer is already performing a 3D convolution.
Similarly for a 3D convolution layer, its really doing a 4D convolution, which is why you need 5 dimensional input.
Upvotes: 2