madsthaks
madsthaks

Reputation: 2181

How do 'forget gates' know not to remove essential information from the Cell State in an LSTM?

First off, I apologize if this isn't appropriate for stack overflow. This isn't really a code related question rather than a theory question.

This isn't completely clear to me. Say you have a massive passage that you want your LSTM to learn off of, how is it making sure it doesnt remove details from the first paragraph?

Upvotes: 1

Views: 275

Answers (2)

danche
danche

Reputation: 1815

In BPTT algorithm, when the word did not play an important role in determining the final output, then the gradient will be small and the weight will become smaller as training is going. It is automatic as LSTM mechanism determines it.

For your concern, you may misunderstand LSTM, LSTM can solve the gradient vanishing problem because it convert the continually multiply to continually plus. Simply speaking, hi = a1*h1+a2*h2+a3*h3+...,the latter output is a function of each previous output, so the gradient is remained. You can refer to An Empirical Exploration of Recurrent Network Architectures for details of the gradient accumulation theory. In addition, nowadays attention mechanism is wide applied and is more appropriate for you need, you can see Neural Machine Translation By Jointly Learning To Align and Translate.

Upvotes: 1

Thomas Wagenaar
Thomas Wagenaar

Reputation: 6759

I believe this paper will be of help. It explains the backpropagation algorithm.

Also note that for LSTM's that process passages, multiple LSTM blocks are used in a sequential and parallel manner. And additionally, neural networks are black boxes: we don't know how the work internally, and they make up which details are important themselves.

Upvotes: 0

Related Questions