Reputation: 7686
According to the API of tf.contrib.rnn.DropoutWrapper
:
output_keep_prob
: unit Tensor or float between 0 and 1, output keep probability; if it is constant and 1, no output dropout will be added.state_keep_prob
: unit Tensor or float between 0 and 1, output keep probability; if it is constant and 1, no output dropout will be added. State dropout is performed on the output states of the cell.the description of these two parameters are almost the same, right?
I set output_keep_prob
as default and state_keep_prob=0.2
, the loss
is always around 11.3
after 400 mini-batches' training, while I set output_keep_prob=0.2
and state_keep_prob
as default, the loss
returned by my model quickly down to around 6.0
after 20 mini-batches! It cost me 4 days to find this bug, really magic, can anyone explain the difference between these two parameters? Thanks a lot!
hyper parameters:
Here is the dataset.
Upvotes: 1
Views: 1577
Reputation: 31
Both are correctly mentioned as output keep probability, which one you should use depends on whether you decide to use outputs or states to compute your logits.
I am providing a code snippet for you to play around with and explore the use cases:
import tensorflow as tf
import numpy as np
tf.reset_default_graph()
# Create input data
X = np.random.randn(2, 20, 8)
# The first example is of length 6
X[0,6:] = 0
X_lengths = [6, 20]
rnn_layers = [tf.nn.rnn_cell.LSTMCell(size, state_is_tuple=True) for
size in [3, 7]]
rnn_layers = [tf.nn.rnn_cell.DropoutWrapper(lstm_cell,
state_keep_prob=0.8, output_keep_prob=0.8) for lstm_cell in
rnn_layers]
# cell = tf.nn.rnn_cell.LSTMCell(num_units=64, state_is_tuple=True)
multi_rnn_cell = tf.nn.rnn_cell.MultiRNNCell(rnn_layers)
outputs, states = tf.nn.dynamic_rnn(
cell=multi_rnn_cell,
dtype=tf.float64,
sequence_length=X_lengths,
inputs=X)
result = tf.contrib.learn.run_n(
{"outputs": outputs, "states": states},
n=1,
feed_dict=None)
assert result[0]["outputs"].shape == (2, 20, 7)
print (result[0]["states"][0].h)
print (result[0]["states"][-1].h)
print (result[0]["outputs"][0][5])
print (result[0]["outputs"][-1][-1])
print(result[0]["outputs"].shape)
print(result[0]["outputs"][0].shape)
print(result[0]["outputs"][1].shape)
assert (result[0]["outputs"][-1][-1]==result[0]["states"
[-1].h[-1]).all()
assert (result[0]["outputs"][0][5]==result[0]["states"]
[-1].h[0]).all()
result[0]["outputs"][0][6:]
will be arrays of all 0s.
Both the assertions will fail in case when state_keep_prob
and output_keep_prob
are <1 but when equated to the same value say 0.8 as in this example you can see apart from the dropout mask they produce the same final state.
If you have variable sequence_length
you should definitely use states
to compute your logits and in that case use state_keep_prob
<1 while training.
If you plan on using outputs(should be used in case of constant sequence_length
or else it needs further manipulation to get the final valid state in case of variable sequence_length
or you may need output at every time step) you should use output_keep_prob
while training.
If output_keep_prob
and state_keep_prob
are both used with different respective dropout values then you will see different values in final returned states in outputs
and states
along with different dropout mask.
Upvotes: 1
Reputation: 1768
state_keep_prob
is the dropout added to the RNN's hidden states. The dropout added to the state of time step i
will influence the calculation of states i+1, i+2, ...
. As you have discovered, this propagation effect is often detrimental to the learning process.output_keep_prob
is the dropout added to the RNN's outputs, the dropout will have no effect on the calculation of the subsequent states.Upvotes: 4