Reputation: 3650
I built a small conv net in tensorflow.What I noticed is that if I add a dropout probability to the fully connected layer, then I have to use lower learning rates or else I get gradient overshoots. Is there any explanation why this keeps happening?
Upvotes: 1
Views: 1029
Reputation: 7148
Funnily in literature the opposite was observed. The original paper to dropout is here: http://www.jmlr.org/papers/volume15/srivastava14a.old/source/srivastava14a.pdf. In Appendix A.2: The authors explain that the learning rate should be increased by 10-100 times, while momentum should also be increased because many gradients cancel each other out. Maybe you are not using a high enough batch size.
Thie following part is my explanation, in contrast to the literature provided above, as to why your observed result happened.
By using 0.5 dropout only half of the neurons are active and contribute to the error. Still the error is similar in size. Therefore the error will be back propagated through the network to only half the neurons. So each neurons "part" in the error doubles.
By using the same learning rate the gradient updates double. Therefore you have the same problem as if you had used a larger learning rate in the first place. By lowering the learning rate the updates are again in the range, which you had previously used.
Upvotes: 2