Ghrua
Ghrua

Reputation: 7686

What's the different between the state_keep_prob and output_keep_prob parameters of tf.contrib.rnn.DropoutWrapper

According to the API of tf.contrib.rnn.DropoutWrapper:

the description of these two parameters are almost the same, right?

I set output_keep_prob as default and state_keep_prob=0.2, the loss is always around 11.3 after 400 mini-batches' training, while I set output_keep_prob=0.2 and state_keep_prob as default, the loss returned by my model quickly down to around 6.0 after 20 mini-batches! It cost me 4 days to find this bug, really magic, can anyone explain the difference between these two parameters? Thanks a lot!

hyper parameters:

Here is the dataset.

Upvotes: 1

Views: 1577

Answers (2)

Bhaskar Arun
Bhaskar Arun

Reputation: 31

Both are correctly mentioned as output keep probability, which one you should use depends on whether you decide to use outputs or states to compute your logits.

I am providing a code snippet for you to play around with and explore the use cases:

import tensorflow as tf
import numpy as np
tf.reset_default_graph()

# Create input data
X = np.random.randn(2, 20, 8)

# The first example is of length 6 
X[0,6:] = 0
X_lengths = [6, 20]
rnn_layers = [tf.nn.rnn_cell.LSTMCell(size, state_is_tuple=True) for 
size in [3, 7]]
rnn_layers = [tf.nn.rnn_cell.DropoutWrapper(lstm_cell, 
state_keep_prob=0.8, output_keep_prob=0.8) for lstm_cell in 
rnn_layers]
# cell = tf.nn.rnn_cell.LSTMCell(num_units=64, state_is_tuple=True)
multi_rnn_cell = tf.nn.rnn_cell.MultiRNNCell(rnn_layers)

outputs, states  = tf.nn.dynamic_rnn(
                                     cell=multi_rnn_cell,
                                     dtype=tf.float64,
                                     sequence_length=X_lengths,
                                     inputs=X)

result = tf.contrib.learn.run_n(
{"outputs": outputs, "states": states},
n=1,
feed_dict=None)
assert result[0]["outputs"].shape == (2, 20, 7)
print (result[0]["states"][0].h)
print (result[0]["states"][-1].h)
print (result[0]["outputs"][0][5])
print (result[0]["outputs"][-1][-1])
print(result[0]["outputs"].shape)
print(result[0]["outputs"][0].shape)
print(result[0]["outputs"][1].shape)
assert (result[0]["outputs"][-1][-1]==result[0]["states" 
[-1].h[-1]).all()
assert (result[0]["outputs"][0][5]==result[0]["states"] 
[-1].h[0]).all()

result[0]["outputs"][0][6:] will be arrays of all 0s.

Both the assertions will fail in case when state_keep_prob and output_keep_prob are <1 but when equated to the same value say 0.8 as in this example you can see apart from the dropout mask they produce the same final state.

If you have variable sequence_length you should definitely use states to compute your logits and in that case use state_keep_prob <1 while training.

If you plan on using outputs(should be used in case of constant sequence_length or else it needs further manipulation to get the final valid state in case of variable sequence_length or you may need output at every time step) you should use output_keep_prob while training.

If output_keep_prob and state_keep_prob are both used with different respective dropout values then you will see different values in final returned states in outputs and states along with different dropout mask.

Upvotes: 1

GeertH
GeertH

Reputation: 1768

  • state_keep_prob is the dropout added to the RNN's hidden states. The dropout added to the state of time step i will influence the calculation of states i+1, i+2, ... . As you have discovered, this propagation effect is often detrimental to the learning process.
  • output_keep_prob is the dropout added to the RNN's outputs, the dropout will have no effect on the calculation of the subsequent states.

Upvotes: 4

Related Questions