Gradient Descent optimizer TensorFlow

Question

I'm new in this world of deep Learning. These days I'm triying to understand well how a neural network works so I'm doing different test. By now I'm using the MNIST database with the numbers from 0 to 9. I've apply a fully connected network with no hidden layers. Here is the code:

from keras.datasets import mnist # subroutines for fetching the MNIST dataset
import tensorflow as tf
import matplotlib.pyplot as plt
import numpy as np
from keras.utils import np_utils # utilities for one-hot encoding of ground truth values
from tensorflow.examples.tutorials.mnist import input_data

mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)

x_train = mnist.train.images
y_train = mnist.train.labels
x_test = mnist.test.images
y_test = mnist.test.labels

test = np.reshape(x_train,[-1,28,28]) #THRESHOLDING
x_train = np.zeros([55000,28,28])
x_train[test > 0.5] = 1


print(x_train.shape)

x_train = np.reshape(x_train,[55000,784])
y_train = np_utils.to_categorical(y_train, 10) # One-hot encode the labels

print(x_train.shape)
print(y_train.shape)

x_test = np.reshape(x_test,[10000,784])

input = tf.placeholder(tf.float32, name='Input')
output = tf.placeholder(tf.float32, name = 'Output')

syn0 = tf.Variable(2*tf.random_uniform([784,10],seed=1)-1, name= 'syn0')
#syn0 = tf.Variable(tf.zeros([784,10], dtype = tf.float32), name= 'syn0')


b1 = tf.Variable(2*tf.random_uniform([10],seed=1)-1, name= 'b1')
#b1 = tf.Variable(tf.zeros([10],dtype = tf.float32), name= 'syn0')

init = tf.global_variables_initializer()

#model

l1 = tf.nn.softmax((tf.matmul(input,syn0) + b1),name='layer1')

error = tf.square(tf.subtract(l1,output),name='error')
loss = tf.reduce_sum(error, name='cost')

#optimizer
with tf.name_scope('trainning'):
    optimizer = tf.train.GradientDescentOptimizer(0.01)
    train = optimizer.minimize(loss)


#session
sess = tf.Session()
sess.run(init)

syn0_ini = sess.run(syn0)

#trainning
for i in range (10000):
    batch_xs, batch_ys = mnist.train.next_batch(128)
    _,lossNow =  sess.run([train,loss],{input: batch_xs,output: batch_ys})

    if i%10 == 0:
        print("Loss in iteration " , i, " is: ", lossNow )

#print debug 

y_pred = sess.run(l1,{input: x_test,output: y_test})

correct_prediction = tf.equal(tf.argmax(y_pred,1), tf.argmax(y_test,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

print()
print("Final Accuracy: ", sess.run(accuracy))

I've printed the weights (syn0) and I see nothing. But if I initialize them to zero, I can see the shape of the numbers. That's logic because since there are no hidden layers, it is like a correlation.

So in the first case I can assume that I can see anything because the weights haven't been modified and they were initialized to random values.

What I don't understand is why only some weights have been modified by the training function since I'm feeding it with a loss that is just one number. So in my opinion, all the weights must be modified in the same way.

Here there are the weights with random initalizations: weigths for 0 weights for 1

Now I put the weights with zero initialization:

weigths for 0 weigths for 1

As you can see, there are some weights that remain as in the beginning, but there are some that change. How is that possible since the loss function is just a scalar number?

Hope my question is clear. If not just tell me.

Thank you very much.

Neb · Accepted Answer

What I don't understand is why only some weights have been modified by the training function since I'm feeding it with a loss that is just one number. So in my opinion, all the weights must be modified in the same way.

This is not completely true.

Consider the linear activation in the case of a single training sample:

Z = W*X + b    #(tf.matmul(input,syn0) + b1

Here you are performing the dot product between W and X. basically, you're doing:

Z = sum(W[j] * X[j]) + b

oss: the matmul works because your weights are row-vector and features are column-vector.

After that, you apply the non linear activation function, namely the softmax function. This will give you prediction that you'll use to compute the loss, which is, you said, a scalar.

Now, when performing the backpropagation step, TF will compute the derivative of the loss with respect to each of the component of W. Explicitely:

dW[j] = dL/dZ * dZ/dW[j]

where:

dL/dZ is the derivative of the loss wrt to Z
dZ/dW[j] is the derivative of Z wrt to W

The previous formula comes up from the chain rule.

It turns out that:

dZ/dW[j] = x[j]

That's why you end up with different values for each component.

For further analysis, see this question. Basically, initializing all the weights of all the neurons to 0, makes your net redundant since all neurons will have the same values for W. However, in each neuron, the component of W will be different.

Gradient Descent optimizer TensorFlow

Answers (1)

Related Questions