Reputation: 115
I am trying to implement a neural network with dropout in tensorflow.
tf.layers.dropout(inputs, rate, training)
From the documentation: "Dropout consists in randomly setting a fraction rate of input units to 0 at each update during training time, which helps prevent overfitting. The units that are kept are scaled by 1 / (1 - rate), so that their sum is unchanged at training time and inference time."
Now I understand that this behavior if dropout is applied on top of sigmoid activations that are strictly above zero. If half of the input units are zeroed, the sum of all the outputs will be also halved so it makes sense to scale them by factor of 2 in order to regain some kind of consistency before the next layer.
Now what if one uses the tanh activation which is centered around zero? The reasoning above no longer holds true so is it still valid to scale the output of dropout by the mentioned factor? Is there a way to prevent tensorflow dropout from scaling the outputs?
Thanks in advance
Upvotes: 2
Views: 1329
Reputation: 142
If you have a set of inputs to a node and a set of weights, their weighted sum is a value, S. You can define another random variable by selecting a random fraction f of the original random variables. The weighted sum using the same weights of the random variable defined in this way is S * f. From this, you can see the argument for rescaling is precise if the objective is that the mean of the sum remains the same with and without scaling. This would be true when the activation function is linear in the range of the weighted sums of subsets, and approximately true if the activation function is approximately linear in the range of the weighted sum of subsets.
After passing the linear combination through any non-linear activation function, it is no longer true that rescaling exactly preserves the expected mean. However, if the contribution to a node is not dominated by a small number of nodes, the variance in the sum of a randomly selected subset of a chosen, fairly large size will be relatively small, and if the activation function is approximately linear fairly near the output value, rescaling will work well to produce an output with approximately the same mean. Eg the logistic and tanh functions are approximately linear over any small region. Note that the range of the function is irrelevant, only the differences between its values.
With relu activation, if the original weighted sum is close enough to zero for the weighted sum of subsets to be on both sides of zero, a non-differentiable point in the activation function, rescaling won't work so well, but this is a relatively rare situation and limited to outputs that are small, so may not be a big problem.
The main observations here are that rescaling works best with large numbers of nodes making significant contributions, and relies on local approximate linearity of activation functions.
Upvotes: 1
Reputation: 414
The point of setting the node to have an output of zero is so that neuron would have no effect on the neurons being fed by it. This would create sparsity and hence, attempts to reduce overfitting. When using sigmoid or tanh, the value is still set to zero.
I think your approach of reasoning here is incorrect. Think of contribution rather than sum.
Upvotes: 0