Reputation: 3
I am new in this forum and I have started studying the theory of CNN. It is probably a stupid question but I am confused about the calculation of the CNN outputs shape. I am following a course on Udacity and in one of the tutorials they provide this CNN architecture.
import torch.nn as nn
import torch.nn.functional as F
# define the CNN architecture
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
# convolutional layer (sees 32x32x3 image tensor)
self.conv1 = nn.Conv2d(3, 16, 3, padding=1)
# convolutional layer (sees 16x16x16 tensor)
self.conv2 = nn.Conv2d(16, 32, 3, padding=1)
# convolutional layer (sees 8x8x32 tensor)
self.conv3 = nn.Conv2d(32, 64, 3, padding=1)
# max pooling layer
self.pool = nn.MaxPool2d(2, 2)
# linear layer (64 * 4 * 4 -> 500)
self.fc1 = nn.Linear(64 * 4 * 4, 500)
# linear layer (500 -> 10)
self.fc2 = nn.Linear(500, 10)
# dropout layer (p=0.25)
self.dropout = nn.Dropout(0.25)
Could you please help in understanding the way they calculate the outputs of the CNN layers? (The starting shape of the images is 32x32x3) More specifically how did they end up with this:
# linear layer (64 * 4 * 4 -> 500)
self.fc1 = nn.Linear(64 * 4 * 4, 500)
Thanks a lot
Upvotes: 0
Views: 213
Reputation: 1822
It misses the definition of the forward pass and one can guess there is a 2x2 pooling after each conv
layer. Hence, the pooling implies a subsampling each time (see the comments) and the 32x32 images becomes 16x16 after conv1
(+ 2x2 pooling), 8x8 after conv2
(+ 2x2 pooling) and 4x4 after conv3
(+ 2x2 pooling). Since conv3
has 64 filters, it outputs 64 feature maps of size 4x4. Then, the fc1
maps this tensor to a fully connected layer of size 500. It is exactly what is defined by the line
self.fc1 = nn.Linear(64 * 4 * 4, 500)
Upvotes: 1