Reputation: 1116
I'm reviewing the material I did in Andrew Ng's class on ML and trying to implement it in TensorFlow. I was able to use scipy's optimize
function to get a cost of 0.213, but with TensorFlow, it's stuck at 0.622
, not very far from the initial loss of 0.693
using an initial set of weights of zero.
I reviewed the post here and added a tf.maximum
call to my loss function to prevent NaN's. I'm not convinced this is the right approach and I'm sure there is a better way. I also tried using tf.clip_by_value
instead but that gives the same non-optimized cost.
iterations = 1500
with tf.Session() as sess:
X = tf.placeholder(tf.float32)
y = tf.placeholder(tf.float32)
theta = tf.Variable(tf.zeros([3,1]), dtype=tf.float32)
training_rows = tf.placeholder(tf.float32)
z = tf.matmul(X, theta)
h_x = 1.0 / (1.0 + tf.exp(-z))
lhs = tf.matmul(tf.transpose(-y), tf.log(tf.maximum(1e-5, h_x)))
rhs = tf.matmul(tf.transpose((1 - y)), tf.log(tf.maximum(1e-5, 1 - h_x)))
loss = tf.reduce_sum(lhs - rhs) / training_rows
alpha = 0.001
optimizer = tf.train.GradientDescentOptimizer(alpha)
train = optimizer.minimize(loss)
# Run the session
X_val, y_val = get_data()
rows = X_val.shape[0]
kwargs = {X: X_val, y: y_val, training_rows: rows}
sess.run(tf.global_variables_initializer())
sess.run(tf.assign(theta, np.array([0,0,0]).reshape(3,1)))
print("Original cost before optimization is: {}".format(sess.run(loss, kwargs)))
print("Optimizing loss function")
costs = []
for i in range(iterations):
optimal_theta, result = sess.run([theta, train], {X: X_val, y: y_val, training_rows: rows})
cost = sess.run(loss, kwargs)
costs.append(cost)
optimal_theta,loss = sess.run([theta, loss], {X: X_val, y: y_val, training_rows: rows})
print("Optimal value for theta is: {} with a loss of: {}".format(optimal_theta, loss))
plt.plot(costs)
plt.show()
I also noticed that any learning rate greater than 0.001
would cause the optimizer to dance wildly back and forth with the loss. Is that normal? Finally, when I tried increasing the iterations to 25,000 I realized the cost when down to 0.53
. I was expecting that it would converge in much fewer iterations.
Upvotes: 1
Views: 393
Reputation: 1116
Learned alot trying to figure this out. So far starters, I didn't realize that this part of the loss function could potentially be problematic:
loss = -y log(h(x)) + (1 - y) (log(1 - h(x)))
If h(x), which is the sigmoid function turns out to be 1 (and it can happen if z, i.e. X * theta comes out to be large) then we will be evaluating log(1 - 1) = log(0), and that's infinite.
To fix this problem, I had to use Feature Scaling to normalize the values I had for X. This ensures that X * theta was smaller and similarly z, the sigmoid function would not come out to 1. As z gets large e^-z tends towards zero. So, using feature scaling ensures that our values in z are relatively small and e^-z will have an actual value that can be added to 1 in the denominator calculation of:
z = 1 / (1 + e^-(X*theta))
And for reference, Feature Scaling just means subtracting the mean and dividing by the range.
(arr - mean) / (max - min)
Upvotes: 1