JMS1
JMS1

Reputation: 27

Neural network: stddev of weights as function of layer size. Why?

Quick question about neural networks. I understand why weights are initialized with a small random value. I breaks a tie between weights so that they have a non-zero loss gradient. I was under the impression that it didn't matter much what the small random value was as long as the tie is broken. Then I read this:

weights = tf.Variable(
    tf.truncated_normal([hidden1_units, hidden2_units],
                        stddev=1.0 / math.sqrt(float(hidden1_units))),
    name='weights')

Rather than assign some small constant stddev like 0.1, the designer takes the effort to set it as 1/sqrt the number of nodes in the lower layer.

stddev=1.0 / math.sqrt(float(hidden1_units))

Why would they do that?

Is this more stable? Does it avoid some unwanted behavior? Does it train faster? Should I implement this practice in my own NNs?

Upvotes: 0

Views: 354

Answers (1)

Abhishek
Abhishek

Reputation: 3417

First of all always remember that the aim of these initialization and training is to make sure the neurons and hence the network learns something meaningful.

Now assume you are using a sigmoid activation function

enter image description here

As you can see above the most change in Y given X is near center, and at the extremes the change is very small, and so will be the gradient during back-propogation.

Now wouldn't it be great if we can somehow assure that the input to the activation be in the good region of sigmoid.

So the aim for input of a neuron (sigmoid activation) be:
Mean: Zero
Variance: small (also independent of number of input dim)

Assuming the input layer dim as 'n'

input-to-activation = ∑ n i=1 wi xi
out-of-neuron = sigmoid(input-to-activation)

Now assuming each of wi and xi are independent respectively, and we have normalized the inputs xi to N(0,1).

So, as of now

X : 0 mean and 1 std
W : uniform (1/sqrt(n) , -1/sqrt(n)), assumed
    mean(W) = 0 and var(w) = 1/12(4/n) = 1/(3n) [check variance of uniform dist]

Assumed X and Y are independent and have zero mean
Var(X+Y)=Var(X)+Var(Y). For the sumition over all the (xi wi) .
&
Var(X∗Y)=Var(X)∗Var(Y). For the xi wi .

now check,

mean of input-to-activation: 0
variance of input-to-activation: n* (1/(3n)) = 1/3

so now we are in good zone for an sigmoid activation, meaning not at the extreme ends. Check that the variance is independent of the the number of inputs, n.

Beautiful isn't it ?

But this is not only one way for intialization, Bengio has alos given a initialization way which considers both, the input and output layer of weight initialization. Read this for future details on both

Upvotes: 2

Related Questions