Tensorflow: weight decay vs logits normalization

Question

I am working on a CNN of similar architecture as in cifar10 example. My CNN consists of two convolutional layers, each followed by a max-pooling layer, and three fully connected layers. All layers except final one use relu activation function. Final layer yields logits of order 10^5, so running softmax function results in one-hot encoding. I tried to solve this problems in two different ways.

Firstly, I simply rescaled logits to [-1, 1], i.e. normalized them. This seems to solve the problem, training goes fine and CNN produces reasonable result. But I am not sure if it is the right way to go, it feels like normalizing logits is a workaround that does not solve the initial problem.

Secondly, I applied weight decay. Without logits normalization cross-entropy starts from large values but it goes down steadily as total weight loss does. However accuracy produces a weird pattern. Moreover, network trained with weight decay produces much worse results than the one with logits normalization.

Weight decay is added as follows:

def weight_decay(var, wd):
    weight_decay = tf.mul(tf.nn.l2_loss(var), wd, name='weight_loss')
    tf.add_to_collection('weight_losses', weight_decay)

Accuracy

predictions = tf.nn.softmax(logits)
one_hot_pred = tf.argmax(predictions, 1)
correct_pred = tf.equal(one_hot_pred, tf.argmax(y, 1))
accuracy_batch = tf.reduce_mean(tf.cast(correct_pred, tf.float32))

So how should one address the problem of large magnitude logits? What is the best practice? What can be a reason why applying weight decay produces worse result?

Tensorflow: weight decay vs logits normalization

Answers (1)

Related Questions