Reputation: 33
I can't seem to find a concrete answer to the question. I am currently doing transfer learning from a VGG19 network, and my target domain is document classification (either solely by visual classification or using CNN's feature extraction for another model). I want to understand in which cases is it desirable to keep all fully connected layers of the model, and in which cases should I remove the fully connected layers and make a new fully-connected layer on top of the last convolutional layer. What does each of these choices imply for the training, predictions, etc. ?
These are code examples using Keras of what I mean:
Extracting the last fully connected layer:
original_model = VGG19(include_top=True, weights='imagenet', input_shape=(224, 224, 3))
layer_name = 'fc2'
x = Dropout(0.5)(original_model.get_layer(layer_name).output)
x = BatchNormalization()(x)
predictions = Dense(num_classes, activation='softmax')(x)
features_model = Model(inputs=original_model.input, outputs=predictions)
adam = optimizers.Adam(lr=0.001)
features_model.compile(optimizer=adam, loss='categorical_crossentropy', metrics=['accuracy'])
features_model.summary()
return features_model
Adding one fully connected layer after the last convolutional layer:
original_model = VGG19(include_top=False, weights='imagenet', input_shape=(224, 224, 3))
x = Flatten()(base_model.output)
x = Dense(4096, activation='relu')(x)
x = Dropout(0.5)(x)
x = BatchNormalization()(x)
predictions = Dense(num_classes, activation='softmax')(x)
head_model = Model(input=base_model.input, output=predictions)
adam = optimizers.Adam(lr=0.001)
head_model.compile(optimizer=adam, loss='categorical_crossentropy', metrics=['accuracy'])
head_model.summary()
return head_model
Is there a rule of thumb for what to choose when doing transfer-learning?
Upvotes: 3
Views: 975
Reputation: 5015
According to my past experience, applying transfer learning from stock market to business forecast successfully, you should keep original structure, because if you are doing transfer learning, you will want to load weights trained from original structure, without issues regarding differences in neural net architecture. Then you unfreeze parts of the CNN and your neural net training will start training from a high accuracy and adapt weights for the target problem.
However, if you remove a Flatten
layer, computational cost will decrease as you will have fewer parameters to train.
I follow the rule of keeping neural nets as simple as possible (equals bigger generalization properties), with high efficiency.
@Kamen, as a complement to your comment, regarding how much data you will need, it depends on the variance of your data. More variance, you will need more layers and weights to learn the details. However, when you increase complexity in the architecture, your neural net will be more prone to overfit, than can be decreased using Dropout, for instance.
As fully connected layers are the more expensive part of a neural net, if you add one or two of them your parameter number will increase a lot, demanding more time to train. With more layers you will get a higher accuracy, but you may have overfit.
For instance, MNIST with 10,000 examples can reach an accuracy bigger than 99% with a quite simple architecture. However, IMAGENET has 1,000,000 examples (155 GB) and then demands a more complex structure, like VGG16.
Upvotes: 1