Shubhashis
Shubhashis

Reputation: 10631

Why input is scaled in tf.nn.dropout in tensorflow?

I can't understand why dropout works like this in tensorflow. The blog of CS231n says that, "dropout is implemented by only keeping a neuron active with some probability p (a hyperparameter), or setting it to zero otherwise." Also you can see this from picture(Taken from the same site) enter image description here

From tensorflow site, With probability keep_prob, outputs the input element scaled up by 1 / keep_prob, otherwise outputs 0.

Now, why the input element is scaled up by 1/keep_prob? Why not keep the input element as it is with probability and not scale it with 1/keep_prob?

Upvotes: 44

Views: 15738

Answers (4)

Tommaso Di Noto
Tommaso Di Noto

Reputation: 1362

If you keep reading in cs231n, the difference between dropout and inverted dropout is explained.

In a network with no dropout, the activations in layer L will be aL. The weights of next layer (L+1) will be learned in such a manner that it receives aL and produces output accordingly. But with a network containing dropout (with keep_prob = p), the weights of L+1 will be learned in such a manner that it receives p*aL and produces output accordingly. Why p*aL? Because the Expected value, E(aL), will be probability_of_keeping(aL)*aL + probability_of_not_keeping(aL)*0 which will be equal to p*aL + (1-p)*0 = p*aL. In the same network, during testing time there will be no dropout. Hence the layer L+1 will receive aL simply. But its weights were trained to expect p*aL as input. Therefore, during testing time you will have to multiply the activations with p. But instead of doing this, you can multiply the activations with 1/p during training only. This is called inverted dropout.

Since we want to leave the forward pass at test time untouched (and tweak our network just during training), tf.nn.dropout directly implements inverted dropout, scaling the values.

Upvotes: 0

Alaroff
Alaroff

Reputation: 2298

Here is a quick experiment to disperse any remaining confusion.

Statistically the weights of a NN-layer follow a distribution that is usually close to normal (but not necessarily), but even in the case when trying to sample a perfect normal distribution in practice, there are always computational errors.

Then consider the following experiment:

DIM = 1_000_000                      # set our dims for weights and input
x = np.ones((DIM,1))                 # our input vector
#x = np.random.rand(DIM,1)*2-1.0     # or could also be a more realistic normalized input

probs = [1.0, 0.7, 0.5, 0.3]         # define dropout probs

W = np.random.normal(size=(DIM,1))   # sample normally distributed weights
print("W-mean = ", W.mean())         # note the mean is not perfect --> sampling error!

# DO THE DRILL
h = defaultdict(list)
for i in range(1000):
  for p in probs:
    M = np.random.rand(DIM,1)
    M = (M < p).astype(int)
    Wp = W * M
    a = np.dot(Wp.T, x)
    h[str(p)].append(a)

for k,v in h.items():
  print("For drop-out prob %r the average linear activation is %r (unscaled) and %r (scaled)" % (k, np.mean(v), np.mean(v)/float(k)))

Sample output:

x-mean =  1.0
W-mean =  -0.001003985674840264
For drop-out prob '1.0' the average linear activation is -1003.985674840258 (unscaled) and -1003.985674840258 (scaled)
For drop-out prob '0.7' the average linear activation is -700.6128015029908 (unscaled) and -1000.8754307185584 (scaled)
For drop-out prob '0.5' the average linear activation is -512.1602655283492 (unscaled) and -1024.3205310566984 (scaled)
For drop-out prob '0.3' the average linear activation is -303.21194422742315 (unscaled) and -1010.7064807580772 (scaled)

Notice that the unscaled activations diminish due to the statistically imperfect normal distribution.

Can you spot an obvious correlation between the W-mean and the average linear activation means?

Upvotes: 0

Trideep Rath
Trideep Rath

Reputation: 3703

Let's say the network had n neurons and we applied dropout rate 1/2

Training phase, we would be left with n/2 neurons. So if you were expecting output x with all the neurons, now you will get on x/2. So for every batch, the network weights are trained according to this x/2

Testing/Inference/Validation phase, we dont apply any dropout so the output is x. So, in this case, the output would be with x and not x/2, which would give you the incorrect result. So what you can do is scale it to x/2 during testing.

Rather than the above scaling specific to Testing phase. What Tensorflow's dropout layer does is that whether it is with dropout or without (Training or testing), it scales the output so that the sum is constant.

Upvotes: 5

mrry
mrry

Reputation: 126154

This scaling enables the same network to be used for training (with keep_prob < 1.0) and evaluation (with keep_prob == 1.0). From the Dropout paper:

The idea is to use a single neural net at test time without dropout. The weights of this network are scaled-down versions of the trained weights. If a unit is retained with probability p during training, the outgoing weights of that unit are multiplied by p at test time as shown in Figure 2.

Rather than adding ops to scale down the weights by keep_prob at test time, the TensorFlow implementation adds an op to scale up the weights by 1. / keep_prob at training time. The effect on performance is negligible, and the code is simpler (because we use the same graph and treat keep_prob as a tf.placeholder() that is fed a different value depending on whether we are training or evaluating the network).

Upvotes: 53

Related Questions