Reputation: 121
I've been going through Chollet's Deep Learning with Python, where he briefly covers L2-normalization with regards to Keras. I understand that it prevents overfitting by adding a penalty proportionate to the sum of the square of the weights to the cost function of the layer, helping to keep weights small.
However, in the section covering artistic style transfer, the content loss as a measure is described as:
the L2 norm between the activations of an upper layer in a pretrained convnet, computed over the target image, and the activations of the same layer computed over the generated image. This guarantees that, as seen from the upper layer, the generated image will look similar.
The style loss is also related to the L2-norm, but let's focus on the content loss for now.
So, the relevant code snippet (p.292):
def content_loss(base, combination):
return K.sum(K.square(combination - base))
outputs_dict = dict([(layer.name, layer.output) for layer in model.layers])
content_layer = 'block5_conv2'
style_layers = ['block1_conv1',
'block2_conv1',
'block3_conv1',
'block4_conv1',
'block5_conv1']
total_variation_weight = 1e-4
style_weight = 1.
content_weight = 0.025
#K here refers to the keras backend
loss = K.variable(0.)
layer_features = outputs_dict[content_layer]
target_image_features = layer_features[0, :, :, :]
combination_features = layer_features[2, :, :, :]
loss += content_weight * content_loss(target_image_features,
combination_features)
I don't understand why we use the outputs of each layer, which are image feature maps, as opposed to Keras's get_weights()
method to fetch the weights to perform normalization. I do not follow how using L2-normalization on these feature maps penalizes during training, or moreover what exactly is it penalizing?
Upvotes: 1
Views: 435
Reputation: 33420
I understand that it prevents overfitting by adding a penalty proportionate to the sum of the square of the weights to the cost function of the layer, helping to keep weights small.
What you are referring to is (weight) regularization and in this case, it is L2-regularization. The L2-norm of a vector is the sum of squared of its elements and therefore when you apply L2-regularization on the weights (i.e. parameters) of a layer it would be considered (i.e. added) in the loss function. Since we are minimizing the loss function the side effect is that the L2-norm of the weights will be reduced as well which in turn means that the value of weights has been reduced (i.e. small weights).
However, in the style transfer example the content loss is defined as the L2-norm (or L2-loss in this case) of the difference of between activation (and not weights) of a specific layer (i.e. content_layer
) when applied on the target image and the combination image (i.e. target image + style):
return K.sum(K.square(combination - base)) # that's exactly the definition of L2-norm
So no weight regularization is involved here. Rather, the loss function used is the L2-norm and it is used as a measure of similarity of two arrays (i.e. activations of the content layer). The smaller the L2-norm, the more similar the activations.
Why activations of the layer and not its weights? Because we want to make sure that the contents (i.e. representations given by the content_layer
) of the target image and the combination image are similar. Note that weights of a layer are fixed and does not change (after training, of course) with respect to an input image; rather, they are used to describe or represent an specific input image, and that representation is called activations of that layer for that specific image.
Upvotes: 2