Reputation: 539
I'm trying to build a simple multilayer perceptron model on a large data set but I'm getting the loss value as nan. The weird thing is: after the first training step, the loss value is not nan and is about 46 (which is oddly low. when i run a logistic regression model, the first loss value is about ~3600). But then, right after that the loss value is constantly nan. I used tf.print to try and debug it as well.
The goal of the model is to predict ~4500 different classes - so it's a classification problem. When using tf.print, I see that after the first training step (or feed forward through MLP), the predictions coming out from the last fully connected layer seem right (all varying numbers between 1 and 4500). But then, after that the outputs from the last fully connected layer go to either all 0's or some other constant number (0 0 0 0 0).
For some information about my model:
3 layer model. all fully connected layers.
batch size of 1000
learning rate of .001 (i also tried .1 and .01 but nothing changed)
using CrossEntropyLoss (i did add an epsilon value to prevent log0)
using AdamOptimizer
learning rate decay is .95
The exact code for the model is below: (I'm using the TF-Slim library)
input_layer = slim.fully_connected(model_input, 5000, activation_fn=tf.nn.relu)
hidden_layer = slim.fully_connected(input_layer, 5000, activation_fn=tf.nn.relu)
output = slim.fully_connected(hidden_layer, vocab_size, activation_fn=tf.nn.relu)
output = tf.Print(output, [tf.argmax(output, 1)], 'out = ', summarize = 20, first_n = 10)
return {"predictions": output}
Any help would be greatly appreciated! Thank you so much!
Upvotes: 2
Views: 7186
Reputation: 214
From my understanding Relu doesn't put a cap on the upper bound for Neural Networks so its more likely to deconverge depending upon its implementation.
Try switching all the activation functions to tanh or sigmoid. Relu is generally used for convolution in cnns.
Its also difficult to determine if your deconverging due to cross entropy as we don't know how you effected it with your epsilon value. Try just using the residual its much simpler but still effective.
Also a 5000-5000-4500 neural network is huge. Its unlikely you actually need a network that large.
Upvotes: 0
Reputation: 256
Two (possibly more) reasons why it doesn't work:
Upvotes: 3
Reputation: 659
For some reasons, your training process has diverged, and you may have infinite values in your weights, wich gives NaN losses. The reasons can be many, try changing your training parameters (use smaller batchs for test).
Also, using a relu for the last output in a classifier is not the usual method, try using a sigmoid.
Upvotes: 0