Reputation: 539
I've seen in multiple places that you should disable dropout during validation and testing stages and only keep it during the training phase. Is there a reason why that should happen? I haven't been able to find a good reason for that and was just wondering.
One reason I'm asking is because I trained a model with dropout, and the results turned out well - about 80% accuracy. Then, I went on to validate the model but forgot to set the prob to 1 and the model's accuracy went down to about 70%. Is it supposed to be that drastic? And is it as simple as setting the prob to 1 in each dropout layer?
Thanks in advance!
Upvotes: 29
Views: 24803
Reputation: 7130
There is a Bayesian technique called Monte Carlo dropout in which the dropout would be not disabled during testing. The model will run several times with the same dropout rate(or in one go as a batch), and the mean(line 6 depicted below) and variance(line 7 depicted below) of the results will be calculated to determine the uncertainty.
Here is Uber's application to quantify uncertainty:
Upvotes: 5
Reputation: 7130
Dropout is a method of making bagging practical for ensembles of very many large neural networks.
Along the same line we may remember that using the following false explanation: For the new data, we can predict their classes by taking the average of the results from all the learners:
Since N is a constant we can just ignore it and the result remains the same, so we should disable dropout during validation and testing.
The true reason is much more complex. It is because of the weight scaling inference rule:
We can approximate p_{ensemble} by evaluating p(y|x) in one model: the model with all units, but with the weights going out of unit i multiplied by the probability of including unit i. The motivation for this modification is to capture the right expected value of the output from that unit. There is not yet any theoretical argument for the accuracy of this approximate inference rule in deep nonlinear networks, but empirically it performs very well.
When we train the model using dropout(for example for one layer) we zero out some outputs of some neurons and scale the others up by 1/keep_prob to keep the expectation of the layer almost the same as before. In the prediction process, we can use dropout but we can only get different predictions each time because we drop the values out randomly, then we need to run the prediction many times to get the expected output. Such a process is time-consuming so we can remove the dropout and the expectation of the layer remains the same.
Reference:
Upvotes: 10
Reputation: 896
Simplest reason can be, during prediction(test, validation or after production deployment) you want to use the capability of each and every learned neurons and really don't like to skip some of them randomly.
Thats the only reason we set probability as 1 during testing.
Upvotes: 5
Reputation: 31
Short answer: Dropouts to bring down over fitting in the training data. They are used as a regularization parameters. So if you have high variance (i.e. look at the difference between training set and validation set accuracy for this) then use drop out on training data, as it won't be good enough to apply dropout on test and validation data as you haven't been sure about the neurons which are going to shut off hence laying off the importance of random neurons which can be important.
Upvotes: 1
Reputation: 6749
Dropout is a random process of disabling neurons in a layer with chance p. This will make certain neurons feel they are 'wrong' in each iteration - basically, you are making neurons feel 'wrong' about their output so that they rely less on the outputs of the nodes in the previous layer. This is a method of regularization and reduces overfitting.
However, there are two main reasons you should not use dropout to test data:
However, you might want to read some more on what validation/testing exactly is:
Training set: a set of examples used for learning: to fit the parameters of the classifier In the MLP case, we would use the training set to find the “optimal” weights with the back-prop rule
Validation set: a set of examples used to tune the parameters of a classifier In the MLP case, we would use the validation set to find the “optimal” number of hidden units or determine a stopping point for the back-propagation algorithm
Test set: a set of examples used only to assess the performance of a fully-trained classifier In the MLP case, we would use the test to estimate the error rate after we have chosen the final model (MLP size and actual weights) After assessing the final model on the test set, YOU MUST NOT tune the model any further!
Why separate test and validation sets? The error rate estimate of the final model on validation data will be biased (smaller than the true error rate) since the validation set is used to select the final model After assessing the final model on the test set, YOU MUST NOT tune the model any further!
source : Introduction to Pattern Analysis,Ricardo Gutierrez-OsunaTexas A&M University, Texas A&M University (answer)
So even for validation, how would you determine which nodes you remove if the nodes have a random probability of being disactivated?
Upvotes: 29