Reputation: 4159
I am working on a CNN of similar architecture as in cifar10 example. My CNN consists of two convolutional layers, each followed by a max-pooling layer, and three fully connected layers. All layers except final one use relu
activation function. Final layer yields logits of order 10^5
, so running softmax function results in one-hot encoding. I tried to solve this problems in two different ways.
Firstly, I simply rescaled logits to [-1, 1]
, i.e. normalized them. This seems to solve the problem, training goes fine and CNN produces reasonable result. But I am not sure if it is the right way to go, it feels like normalizing logits is a workaround that does not solve the initial problem.
Secondly, I applied weight decay. Without logits normalization cross-entropy starts from large values but it goes down steadily as total weight loss does. However accuracy produces a weird pattern. Moreover, network trained with weight decay produces much worse results than the one with logits normalization.
Weight decay is added as follows:
def weight_decay(var, wd):
weight_decay = tf.mul(tf.nn.l2_loss(var), wd, name='weight_loss')
tf.add_to_collection('weight_losses', weight_decay)
Accuracy
predictions = tf.nn.softmax(logits)
one_hot_pred = tf.argmax(predictions, 1)
correct_pred = tf.equal(one_hot_pred, tf.argmax(y, 1))
accuracy_batch = tf.reduce_mean(tf.cast(correct_pred, tf.float32))
So how should one address the problem of large magnitude logits? What is the best practice? What can be a reason why applying weight decay produces worse result?
Upvotes: 3
Views: 1541
Reputation: 31
In my opinion, tf.nn.l2_loss(var)
is not normalized, in other words, maybe your wd
is too big, or somehow not suitable.
Upvotes: -1