How do 'forget gates' know not to remove essential information from the Cell State in an LSTM?

Question

First off, I apologize if this isn't appropriate for stack overflow. This isn't really a code related question rather than a theory question.

This isn't completely clear to me. Say you have a massive passage that you want your LSTM to learn off of, how is it making sure it doesnt remove details from the first paragraph?

danche · Accepted Answer

In BPTT algorithm, when the word did not play an important role in determining the final output, then the gradient will be small and the weight will become smaller as training is going. It is automatic as LSTM mechanism determines it.

For your concern, you may misunderstand LSTM, LSTM can solve the gradient vanishing problem because it convert the continually multiply to continually plus. Simply speaking, hi = a1*h1+a2*h2+a3*h3+...，the latter output is a function of each previous output, so the gradient is remained. You can refer to An Empirical Exploration of Recurrent Network Architectures for details of the gradient accumulation theory. In addition, nowadays attention mechanism is wide applied and is more appropriate for you need, you can see Neural Machine Translation By Jointly Learning To Align and Translate.

How do 'forget gates' know not to remove essential information from the Cell State in an LSTM?

Answers (2)

Related Questions

How do &#39;forget gates&#39; know not to remove essential information from the Cell State in an LSTM?

Answers (2)

Related Questions

How do 'forget gates' know not to remove essential information from the Cell State in an LSTM?