Reputation: 1
I trained a sequence to sequence Bi-LSTM attention model (2 layers encoder, 2 layers decoder) for decoding sequences, the sequences all have the same length (input or output). To compare I trained a Bi-GRU model with only two layers for the same task with the same data. The seq2seq model perfomred better than GRU model and its loss converged faster, but when I test on longer sequences the GRU model can generalize better than seq2seq model, when this latter loses its performance.
Why can this happend ? Knowing that attention models are tend to genralize on varity of sequence lengths.
I tried to interpret that seq2seq models tend more to overfit on the sequence length, so I suppose that the trainset should contain different sequence lengths. Also the architecture of GRU and LSTM can be a cause for that, since GRU model is less complex, maybe this helped in a way to perform better.
Upvotes: 0
Views: 42