Reputation: 2373
I am using Tensorflow's combination of GRUCell
+ MultiRNNCell
+ dynamic_rnn
to generate a multi-layer LSTM to predict a sequence of elements.
In the few examples I have seen, like character-level language models, once the Training stage is done, the Generation seems to be done by feeding only ONE 'character' (or whatever element) at a time to get the next prediction, and then getting the following 'character' based on the first prediction, etc.
My question is, since Tensorflow's dynamic_rnn
unrolls the RNN graph into an arbitrary number of steps of whatever sequence length is fed into it, what is the benefit of feeding only one element at a time, once a prediction is gradually being built out? Doesn't it make more sense to be gradually collecting a longer sequence with each predictive step and re-feeding it into the graph? I.e. after generating the first prediction, feed back a sequence of 2 elements, and then 3, etc.?
I am currently trying out the prediction stage by initially feeding in a sequence of 15 elements (actual historic data), getting the last element of the prediction, and then replacing one element in the original input with that predicted value, and so on in a loop of N predictive steps.
What is the disadvantage of this approach versus feeding just one element at-a-time?
Upvotes: 0
Views: 1386
Reputation: 714
I'm not sure your approach is actually doing what you want it to do.
Let's say we have an LSTM network trained to generate the alphabet. Now, in order to have the network generate a sequence, we start with a clean state h0
and feed in the first character, a
. The network outputs a new state, h1
, and its prediction, b
, which we append to our output. Next, we want the network to predict the next character based on the current output, ab
. If we would feed the network ab
with the state being h1
at this step, its perceived sequence would be aab
, because h1
was calculated after the first a
, and now we put in another a
and a b
. Alternatively, we could feed ab
and a clean state h0
into the network, which would provide a proper output (based on ab
), but we would perform unnecessary calculations for the whole sequence except b
, because we already calculated the state h1
which corresponds to the network reading the sequence a
, so in order to get the next prediction and state we only have to feed in the next character, b
.
So to answer your question, feeding the network one character at a time makes sense because the network needs to see each character only once, and feeding the same character multiple times would just be unnecessary calculations.
Upvotes: 1
Reputation: 380
This is an great question, I asked something very similar here.
The idea being instead of sharing weights across time (one element at-a-time as you describe it), each time step gets it's own set of weights.
I believe there are several reasons for training one-step at a time, mainly computational complexity and training difficulty. The number of weights you'll need to train grows linearly for each time step. You'd need some pretty sporty hardware to train long sequences. Also for long sequences you'll need a very large data set to train all those weights. But imho, I am still optimistic that for the right problem, with sufficient resources, it would show improvement.
Upvotes: 1