Reputation: 495
I am having trouble understanding some of the parameters of LSTM
layers in the tf.keras.layers
API.
I am investigating using CuDNNLSTM
layers instead of LSTM
layers (to speed up training), but before I commit to CuDNN
layers, I would like to have a full understanding of the parameters that I lose by using a CuDNNLSTM
instead of a LSTM
layer. I have read the docs, but they seem to assume some prior knowledge of LSTM
s that I do not have.
I have listed the pararameters that CuDNNLSTM
does not have (that LSTM
has) and interspersed with my questions about them, respectively.
activation
recurrent_activation
activation
and recurrent_activation
? I am assuming it has something to do with the activation for a cell vs. the activation for the full LSTM
layer, but am unsure.use_bias
use_bias
is True, where is this bias applied?dropout
recurrent_dropout
dropout
and recurrent_dropout
? If recurrent_dropout
is dropout between the LSTM cells, that does not make sense to me, because you would be ignoring the previous output, which I thought would defeat the purpose of having an RNN.tf.keras.models.sequential([Input(...), LSTM(...), Dropout(0.5)])
or tf.keras.models.sequential([Input(...), Dropout(0.5), LSTM(...)])
instead of tf.keras.models.sequential([Input(...), LSTM(..., dropout=0.5)])
)implementation
CuDNN
layers, since it would probably make it harder to parallelize. However, in LSTM
s, does this impact the result (i.e. with the same seed, will implementation=1
converge to the same or different result as implementation=2
)?unroll
I've read a lot about LSTM
s, and am at a point where I've decided to start training things, otherwise I won't absorb much more hypothetical knowledge. I've tried a lot of things in modeling, too, but the network I'm training is really simple so nothing seems to impact the results.
Upvotes: 3
Views: 1220
Reputation: 11333
activation
vs recurrent_activation
If you look at the LSTM equations. activation
(defaults to sigmoid
) refers to the activations used for the gates (i.e. input/forget/output), and recurrent_activation
(defaults to tanh
) refers to the activation used for other things (e.g. the cell output h).
I can explain why the need to two intuitively. For a gate, a range between 0-1 sounds intuitive because a gate can be either on or off or in the middle, but not negative (thus sigmoid
). However the cell output, will be more expressive and leads to less saturation as it ranges between -1 and 1 (thus tanh
). It might also help with solving vanishing gradient. But I'm not entirely sure about that.
use_bias
If use_bias
is True, there will be a +b
(e.g. i_t = sigma(x_t Ui + h_t-1 Wi + bi)
) in the equations. If not there will be no bias (e.g. i_t = sigma(x_t Ui + h_t-1 Wi)
). Personally, I always use a bias.
dropout
vs recurrent_dropout
The need for dropout
and recurrent_dropout
is that, applying dropout on time-dimension can be quite disasterious, as you are influencing the memory of the model. However applying dropout
on input data, is pretty much what we do day to day with feed-forward models. So,
dropout
: Applies a dropout mask on the input data (x
)recurrent_dropout
: Applices a dropout mask on the previous state data (h_t-1
)implementation
The implementation gives different ways to compute the same thing. The need for the differences might be the different memory requirements.
implementation=1
i_t = sigma(x_t Ui + h_t-1 Wi + bi)
f_t = sigma(x_t Uf + h_t-1 Wf + bf)
o_t = sigma(x_t Uo + h_t-1 Wo + bo)
tilde{c}_t = tanh(x_c Uc + h_t-1 Wc + bc)
implementation=anything else
z = x_t concat(Ui, Uf, Uo, Uc)
z += h_t-1 concat(Wi, Wf, Wo, Wc)
z += concat(bi, bf, bo, bc)
So the second implementation is much efficiant as there's only two matrix multiplications taking place.
unroll
If true, it will unroll the RNN on the time dimension and do computations without a loop (which will be memory intensive). If false, this will be done with a for
loop, which will take longer but less memory intensive.
The source code I referred is found here. Hope this clarifies it.
Upvotes: 4