kiku06
kiku06

Reputation: 1

Keras minibatch gradient descent with Dropout layer

I have a question about Dropout implementation in Keras/Tensorflow with mini-batch gradient descent optimization when batch_size parameter is bigger than one. The original paper says:

The only difference is that for each training case in a mini-batch, we sample a thinned network by dropping out units. Forward and backpropagation for that training case are done only on this thinned network. The gradients for each parameter are averaged over the training cases in each mini-batch. Any training case which does not use a parameter contributes a gradient of zero for that parameter.

But how is it implemented in Keras? As I understand, for each sample in a batch, individual gradient is calculated depending on the current model (as different units as dropped for different samples). Next after all samples from the batch are processed, for each weight respective gradients are summed, these sums are divided by batch_size and then applied to respective weights.

Going through source code I cannot see if and where it is handled. In the function _process_single_batch, the overall\avereged batch loss is computed and based on that batch gradient is calculated. This works fine for models without Dropout layer, but what about Dropout layer, how individual model settings for each sample (with different neurons dropped) are remembered and then taken into account during gradient descent calculation?

I think that I am missing something and I want to be sure that I understand correctly Keras implementation of mini-batch gradient descent when Dropout layer is involved.

Upvotes: 0

Views: 483

Answers (2)

Dr. Snoopy
Dr. Snoopy

Reputation: 56407

What you describe from the paper is a theoretical interpretation of how Dropout can be implemented. It is not really implemented like that in any framework.

Dropout is implemented as a layer that during training, samples a drop mask from a Bernoulli distribution with a given probability. This mask contains 0's and 1's, where a 0 means that this particular neuron was dropped.

Then automatic differentiation is used to compute the gradient through the Dropout layer, which just means multiplying component-wise the incoming gradient from the previous layer by the drop mask, which cancels the gradients from the dropped neurons.

As you mention, the drop mask is key to obtain the appropriate behavior. The gradients and forward pass are computed together, and a different drop mask is sampled for each sample in a batch, meaning that this works without additional support from the framework.

Implementing the full idea of dropping neurons would be much more complicated.

Upvotes: 0

Vladimir Sotnikov
Vladimir Sotnikov

Reputation: 1489

TensorFlow not really "drops out" some neurons out of the model, but just multiplies by zero their outputs. Let's take a look at implementation of dropout:

def dropout_v2(x, rate, noise_shape=None, seed=None, name=None):
    <...>
    # Sample a uniform distribution on [0.0, 1.0) and select values larger than
    # rate.
    #
    # NOTE: Random uniform actually can only generate 2^23 floats on [1.0, 2.0)
    # and subtract 1.0.
    random_tensor = random_ops.random_uniform(
        noise_shape, seed=seed, dtype=x.dtype)
    keep_prob = 1 - rate
    scale = 1 / keep_prob
    # NOTE: if (1.0 + rate) - 1 is equal to rate, then we want to consider that
    # float to be selected, hence we use a >= comparison.
    keep_mask = random_tensor >= rate
    ret = x * scale * math_ops.cast(keep_mask, x.dtype)
    <...>
    return ret

So x is the dropout layer's input and rate is a rate of droppped out neurons. Based on this rate mask is generated - with probability of rate each value in this mask is 0, otherwise it's 1. When we multiply neuron's output by zero, its gradient also becomes zero. So we're not actually sample some subnetwork, but just zeroing out some of its neurons. Hope that will help :)

Upvotes: 0

Related Questions