Reputation: 147
I am a little confused with the difference between conv2d and conv3d functions. For example, if I have a stack of N images with H height and W width, and 3 RGB channels. The input to the network can be two forms form1: (batch_size, N, H, W, 3) this is a rank 5 tensor form2: (batch_size, H, W, 3N ) this is a rank 4 tensor
The question is, if I apply conv3d with M filters with size (N,3,3) to form1 and apply conv2d with M filters with size (3,3)
Do they have basicly the same feature operations? I think both of these forms convolve in temporal and spatial dimension.
I really appreciate if anyone can help me figure this out.
Upvotes: 8
Views: 17227
Reputation: 1466
If you have a stack of images, you have a video. You can not have two input forms. You have either images or videos. For the video case you can use 3D convolution and 2D convolution is not defined for it. If you stack the channels as you mentioned it (3N) the 2D convolution will interpret the stack as one image with a lot of channels, but not as stack.
Note here that a 2D convolution with (batch, H, W, Channels) is the same as an 3D convolution with (batch, H, W, Channels, 1).
Upvotes: 4