mickey
mickey

Reputation: 495

Meaning of tf.keras.layers.LSTM parameters

I am having trouble understanding some of the parameters of LSTM layers in the tf.keras.layers API.

I am investigating using CuDNNLSTM layers instead of LSTM layers (to speed up training), but before I commit to CuDNN layers, I would like to have a full understanding of the parameters that I lose by using a CuDNNLSTM instead of a LSTM layer. I have read the docs, but they seem to assume some prior knowledge of LSTMs that I do not have.

I have listed the pararameters that CuDNNLSTM does not have (that LSTM has) and interspersed with my questions about them, respectively.

I've read a lot about LSTMs, and am at a point where I've decided to start training things, otherwise I won't absorb much more hypothetical knowledge. I've tried a lot of things in modeling, too, but the network I'm training is really simple so nothing seems to impact the results.

Upvotes: 3

Views: 1220

Answers (1)

thushv89
thushv89

Reputation: 11333

activation vs recurrent_activation

If you look at the LSTM equations. activation (defaults to sigmoid) refers to the activations used for the gates (i.e. input/forget/output), and recurrent_activation (defaults to tanh) refers to the activation used for other things (e.g. the cell output h).

I can explain why the need to two intuitively. For a gate, a range between 0-1 sounds intuitive because a gate can be either on or off or in the middle, but not negative (thus sigmoid). However the cell output, will be more expressive and leads to less saturation as it ranges between -1 and 1 (thus tanh). It might also help with solving vanishing gradient. But I'm not entirely sure about that.

use_bias

If use_bias is True, there will be a +b (e.g. i_t = sigma(x_t Ui + h_t-1 Wi + bi)) in the equations. If not there will be no bias (e.g. i_t = sigma(x_t Ui + h_t-1 Wi)). Personally, I always use a bias.

dropout vs recurrent_dropout

The need for dropout and recurrent_dropout is that, applying dropout on time-dimension can be quite disasterious, as you are influencing the memory of the model. However applying dropout on input data, is pretty much what we do day to day with feed-forward models. So,

  • dropout: Applies a dropout mask on the input data (x)
  • recurrent_dropout: Applices a dropout mask on the previous state data (h_t-1)

implementation

The implementation gives different ways to compute the same thing. The need for the differences might be the different memory requirements.

  • implementation=1
    • Here, the computations are done as if you would have written the following equations. In other words, do those in four steps.
      • i_t = sigma(x_t Ui + h_t-1 Wi + bi)
      • f_t = sigma(x_t Uf + h_t-1 Wf + bf)
      • o_t = sigma(x_t Uo + h_t-1 Wo + bo)
      • tilde{c}_t = tanh(x_c Uc + h_t-1 Wc + bc)
  • implementation=anything else
    • You do the above computations at one go as,
      • z = x_t concat(Ui, Uf, Uo, Uc)
      • z += h_t-1 concat(Wi, Wf, Wo, Wc)
      • z += concat(bi, bf, bo, bc)
      • apply activations

So the second implementation is much efficiant as there's only two matrix multiplications taking place.

unroll

If true, it will unroll the RNN on the time dimension and do computations without a loop (which will be memory intensive). If false, this will be done with a for loop, which will take longer but less memory intensive.

The source code I referred is found here. Hope this clarifies it.

Upvotes: 4

Related Questions