Peter HIRT
Peter HIRT

Reputation: 741

Problems with initialisation of weight and biases?

I use a CNN to do classification (6 classes) of 32x32bit images.

The network is behaving very strangely as it takes 900 steps to do anything different from these results pasted. Afterwards it starts to move to reach reasonable values after several 1000 steps. The "same network" done in Theano/Keras with the same data is showing results much faster. I say "same network" this because Theano/Keras let me define leakyRELU and some other initialisation methods on variables (init = "orthogonal") that let me believe my problem is with initialisation. But I could be far off ;-)

    step        ->  900
    Minibatch loss at step 900: 1.440395
    Minibatch accuracy: 43.0%
    Validation accuracy: 35.5%
    Val F-score: 0.0872979214781    
    confusion matrix 
 [[  0   0   0   0   0   0]
 [  0   0   0   0   0   0]
 [  0   0   0   0   0   0]
 [  0   0   0   0   0   0] 
 [ 21 292 224 398 567  96]
 [  0   0   0   0   0   0]]

my optimizer is as

tf.train.GradientDescentOptimizer(0.1).minimize(loss)

my weights and bias initialisation is (only layer 1 of 6 is shown, but identical)

wc1 = tf.Variable(tf.truncated_normal([2, 2, 1, 16], stddev=0.05), name="weights_1")
bc1 = tf.Variable(tf.zeros([16]), name="bias_1")

can anybody give me hint towards:

  1. why it takes 900 steps to get the network to "do something"
  2. why is the confusion matrix giving me this uniform prediction in the beginning

thanks a lot Peter

Upvotes: 1

Views: 928

Answers (2)

thibaultbl
thibaultbl

Reputation: 984

If you use relu activation function, look at "kaiming initialisation" for your weight. The objective is to keep a mean of 0 and a standard deviation of 1 for your output after each layer during the forward pass.

For relu activation fuction you have to initialize with random normal distribution multiplied by the square root of 2/(number of input for the given layer).

weight_initialisation = random_normal * sqrt(2/(number of input for the layer))

For CNN, I think the number of input will be number of filter * number of cell in the kernel (or 5 * 5 for a [5, 5] kernel)

Upvotes: 1

Saad Khan
Saad Khan

Reputation: 326

As long as you're using standard Relu units, initializing bias to 0 is a bad idea because those neurons can die easily. This means that the Relu gets into a regime where it outputs 0 for every input while also having 0 gradient. Hence, it can no longer be trained, and it affects downstream neurons by always outputting 0. The first thing I would try is to initialize a higher bias. Another option is to use a leaky relu analog such as elu, which is available in tensorflow.

Also, what do columns and rows in your confusion matrix mean? Based on that output, it eigher means that every single example has the same label (in which case you should check your labels) or that every single example is getting the same prediction by the network (in which case you should worry about dying neurons).

You could also try lowering your learning rate; it might be too high, leading to instabilities. Lastly, if the problem is related to initialization, you could take the output after 900 steps and use that as initialization. I would only try this after making sure that neuron death is taken care of.

Upvotes: 1

Related Questions