Reputation: 365
I do not really understand the obviously (or actually the same?) training procedures for training a LSTM encoder-decoder.
on the one hand in the tutorial they use a for loop for training: https://www.tensorflow.org/tutorials/text/nmt_with_attention#training
but here https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html
(the first model )
just uses a simple
# Run training
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')
model.fit([encoder_input_data, decoder_input_data], decoder_target_data,
batch_size=batch_size,
epochs=epochs,
validation_split=0.2)
Here, both procedure say, they are training via a teacher forcing method.
But I cannot understand why both ways are the same?
Why I can train an encoder decoder without a for loop like normal model training though I need to previous decoding step for training next decoding step?
Upvotes: 0
Views: 1260
Reputation: 2682
In an LSTM, the output of a time step depends only on the state and the previous time steps. In the second link (keras blog) what is happening during training is that the final state is not being used... only the per-step vector. During inference the state is being saved from one iteration to the next.
The following answer explains the concept of time steps in an LSTM What exactly is timestep in an LSTM Model?
This is a useful picture for the sake of discussion.
To reconcile with the LSTM Keras API:
In this image, the output of step N depends only on [x0, xN].
When you have a model as defined in your link that only depends on the h values in the picture above, them when one calculates the losses/gradients the math is the same whether you do it in one shot or an a loop.
This would not hold if the final LSTM state was used (the side arrow from the right most A block in the picture).
From the Keras LSTM API documentation:
return_state: Boolean. Whether to return the last state in addition to the output. Default: False.
The relevant comment in the code:
# We set up our decoder to return full output sequences,
# and to return internal states as well. We don't use the
# return states in the training model, but we will use them in inference.
You can try to look at a sequence of length 2. If you calculate the gradients of the prediction of time-step 0 and 1 in one-shot, as far as the LSTM is concerned, the gradient for h0 (output of time step 0) is only dependent on the corresponding input; the gradient of h1 (output of time step 1) is dependent on x0 and x1 and the transformations though the LSTM. If you calculate the gradient time step by time step, you end up with the exact same calculation.
If you look at transformer models, you will see that they use a mask to mask out the sequence in order to ensure that step N only depends on previous step N.
Upvotes: 1