V.Vocor
V.Vocor

Reputation: 449

Tensorflow: NaN for custom softmax

Simply exchanging the nn.softmax function for a combination which uses tf.exp, keeping everything else like it was, causes not only the gradients to contain NaN but also the intermediate variable s. I have no idea why this is.

tempX = x
tempW = W
tempMult = tf.matmul(tempX, W)
s = tempMult + b

#! ----------------------------
#p = tf.nn.softmax(s)
p = tf.exp(s) / tf.reduce_sum(tf.exp(s), axis=1)
#!------------------------------


myTemp = y*tf.log(p)
cost = tf.reduce_mean(-tf.reduce_sum(myTemp, reduction_indices=1)) + mylambda*tf.reduce_sum(tf.multiply(W,W))

grad_W, grad_b = tf.gradients(xs=[W, b], ys=cost)

new_W = W.assign(W - tf.multiply(learning_rate, grad_W))
new_b = b.assign(b - tf.multiply(learning_rate, grad_b))

Upvotes: 1

Views: 1683

Answers (1)

Anton Panchishin
Anton Panchishin

Reputation: 3773

Answer

tf.exp(s) easily overflows for large s. That's the main reason that tf.nn.softmax doesn't actually use that equation but does something equilivent to it (according to the docs).

Discussion

When I rewrote your softmax function to

p = tf.exp(s) / tf.reshape( tf.reduce_sum(tf.exp(s), axis=1), [-1,1] )

It worked without a problem.

Here is a fully working python 2.7 implementation that uses a hand-crafted softmax and works (using the reshape function)

# -- imports --
import tensorflow as tf
import numpy as np

# np.set_printoptions(precision=1) reduces np precision output to 1 digit
np.set_printoptions(precision=2, suppress=True)

# -- constant data --
x = [[0., 0.], [1., 1.], [1., 0.], [0., 1.]]
y_ = [[1., 0.], [1., 0.], [0., 1.], [0., 1.]]

# -- induction --
# 1x2 input -> 2x3 hidden sigmoid -> 3x1 sigmoid output

# Layer 0 = the x2 inputs
x0 = tf.constant(x, dtype=tf.float32)
y0 = tf.constant(y_, dtype=tf.float32)

# Layer 1 = the 2x3 hidden sigmoid
m1 = tf.Variable(tf.random_uniform([2, 3], minval=0.1, maxval=0.9, dtype=tf.float32))
b1 = tf.Variable(tf.random_uniform([3], minval=0.1, maxval=0.9, dtype=tf.float32))
h1 = tf.sigmoid(tf.matmul(x0, m1) + b1)

# Layer 2 = the 3x2 softmax output
m2 = tf.Variable(tf.random_uniform([3, 2], minval=0.1, maxval=0.9, dtype=tf.float32))
b2 = tf.Variable(tf.random_uniform([2], minval=0.1, maxval=0.9, dtype=tf.float32))
h2 = tf.matmul(h1, m2) + b2
y_out = tf.exp(h2) / tf.reshape( tf.reduce_sum(tf.exp(h2), axis=1) , [-1,1] )


# -- loss --

# loss : sum of the squares of y0 - y_out
loss = tf.reduce_sum(tf.square(y0 - y_out))

# training step : gradient decent (1.0) to minimize loss
train = tf.train.GradientDescentOptimizer(1.0).minimize(loss)


# -- training --
# run 500 times using all the X and Y
# print out the loss and any other interesting info
#with tf.Session() as sess:
sess = tf.Session()
sess.run(tf.global_variables_initializer())
print "\nloss"
for step in range(500):
    sess.run(train)
    if (step + 1) % 100 == 0:
        print sess.run(loss)

results = sess.run([m1, b1, m2, b2, y_out, loss])
labels = "m1,b1,m2,b2,y_out,loss".split(",")
for label, result in zip(*(labels, results)):
    print ""
    print label
    print result

print ""

Perhaps your initial values for M and b are too large. I tried re-running my above code but with with weights initialized to large numbers and I was able to reproduce your NaN issue.

Upvotes: 1

Related Questions