Reputation: 1223
I am following this explanation of LSTMs
In one of the illustrated examples of the gates, they show the forget gate taking in the old cell state value (which is equal to the previous hidden state?) as well as the new input.
My question is two-fold:
1) If the forget gate is supposed to regulate memory of the previous cell state, why is there a need to take in new input? shouldn’t that just be handled in the input gate exclusively?
2) if the input gate decides what new information is added to the cell state, why do we also feed in the previous cell state in the input gate? Shouldn’t that regulation have already happened in the forget gate?
Overall it seems like there are some redundant processes going on here.
Upvotes: 2
Views: 909
Reputation: 40909
Here are the LSTM equations:
When you look at these equations, you need to mentally separate out how the gates are computed (lines 1 to 3) and how they are applied (lines 5 and 6). They are computed as a function of the hidden state h
, but they are applied to the memory cell c
.
they show the forget gate taking in the old cell state value (which is equal to the previous hidden state?) as well as the new input.
Let's look specifically at the forget
gate computed in line 2. Its computation takes as input the current input x[t]
and the last hidden state h[t-1]
. (Note that the assertion in your comment is incorrect: the hidden state is different from the memory cell.)
In fact all the input
, forget
, and output
gates in lines 1 to 3 are computed uniformly as a function that takes x[t]
and h[t-1]
. Broadly speaking, the value of these gates are based on what the current input is and what the state was previously.
To directly answer your questions:
1) If the forget gate is supposed to regulate memory of the previous cell state, why is there a need to take in new input? shouldn’t that just be handled in the input gate exclusively?
Don't confuse how the gate is computed with how it is applied. Look at how the f
forget gate is used in line 5 to do the regulation that you mentioned. The forget gate is applied only to the previous memory cell c[t-1]
. As you probably know, a gate is simply a vector of floating-point fractional numbers, and it is applied as an element-wise multiplication. Here, the f
gate will be multiplied with c[t-1]
, resulting in some of c[t-1]
being kept. In the same line 5, the i
input gate does the same thing to the new candidate memory cell c-tilde[t]
. The basic idea of line 5 is that the new memory cell c[t]
is mixing together some of the old memory cell and some of the new candidate memory cell.
Line 5 is the most important one among the LSTM equations. You can find a similar line in the GRU equations.
2) if the input gate decides what new information is added to the cell state, why do we also feed in the previous cell state in the input gate? Shouldn’t that regulation have already happened in the forget gate?
Again, you need to separate how the gates are computed and how they are applied. The input gate does indeed regulate what new information is added to the cell state, and that is performed in line 5 as I wrote above.
Upvotes: 4