TensorFlow SGD decay parameter

Question

I am using TensorFlow 2.4.1 and Python3.8 for Computer Vision based CNN models such as VGG-18, ResNet-18/34, etc. My question is specific to weight decay declaration. There are two ways of defining it:

The first is by declaring it for each layer using 'kernel_regularizer' parameter for 'Conv2D' layer
The second is by using 'decay' parameter in TF SGD optimizer

Example codes are:

weight_decay = 0.0005

Conv2D(
    filters = 64, kernel_size = (3, 3),
    activation='relu', kernel_initializer = tf.initializers.he_normal(),
    strides = (1, 1), padding = 'same',
    kernel_regularizer = regularizers.l2(weight_decay),
)
# NOTE: this 'kernel_regularizer' parameter is used for all of the conv layers in ResNet-18/34 and VGG-18 models

optimizer = tf.keras.optimizers.SGD(learning_rate = 0.01, decay = lr_decay, momentum = 0.9)

My question is:

Are these two techniques for using weight decay doing the same thing? If yes, only one should be used to avoid redundancy
If not, does using both of these weight decay definitions add twice the weight decay? Because too much of regularization would push even the helpful weights towards zero and therefore in essence, any model will not learn the desired function.

TensorFlow SGD decay parameter

Answers (1)

Related Questions