Reputation: 763
I am doing a sequence classification task using nn.TransformerEncoder()
. Whose pipeline is similar to nn.LSTM()
.
I have tried several temporal features fusion methods:
Selecting the final outputs as the representation of the whole sequence.
Using an affine transformation to fuse these features.
Classifying the sequence frame by frame, and then select the max values to be the category of the whole sequence.
But, all these 3 methods got a terrible accuracy, only 25% for 4 categories classification. While using nn.LSTM with the last hidden state, I can achieve 83% accuracy easily. I tried plenty of hyperparameters of nn.TransformerEncoder()
, but without any improvement for the accuracy.
I have no idea about how to adjust this model now. Could you give me some practical advice? Thanks.
For LSTM
: the forward()
is:
def forward(self, x_in, x_lengths, apply_softmax=False):
# Embed
x_in = self.embeddings(x_in)
# Feed into RNN
out, h_n = self.LSTM(x_in) #shape of out: T*N*D
# Gather the last relevant hidden state
out = out[-1,:,:] # N*D
# FC layers
z = self.dropout(out)
z = self.fc1(z)
z = self.dropout(z)
y_pred = self.fc2(z)
if apply_softmax:
y_pred = F.softmax(y_pred, dim=1)
return y_pred
For transformer
:
def forward(self, x_in, x_lengths, apply_softmax=False):
# Embed
x_in = self.embeddings(x_in)
# Feed into RNN
out = self.transformer(x_in)#shape of out T*N*D
# Gather the last relevant hidden state
out = out[-1,:,:] # N*D
# FC layers
z = self.dropout(out)
z = self.fc1(z)
z = self.dropout(z)
y_pred = self.fc2(z)
if apply_softmax:
y_pred = F.softmax(y_pred, dim=1)
return y_pred
Upvotes: 4
Views: 6931
Reputation: 1
I am not sure if Selecting the final outputs as the representation of the whole sequence.
is correct for transformers. As these models do not work the same way as recurrent networks. Last time point does not represent a complete embedding of a sequence. So using just last time-point I think you're discarding lots of information.
Upvotes: 0
Reputation: 37771
The accuracy you mentioned indicates that something is wrong. Since you are comparing LSTM with TransformerEncoder, I want to point to some crucial differences.
Positional embeddings: This is very important since the Transformer does not have recurrence concept and so it doesn't capture sequence information. So, make sure you add positional information along with the input embeddings.
Model architecture: d_model
, n_head
, num_encoder_layers
are important. Go with the default size as used in Vaswani et al., 2017. (d_model=512
, n_head=8
, num_encoder_layers=6
)
Optimization: In many scenarios, it has been found that the Transformer needs to be trained with smaller learning rate, large batch size, WarmUpScheduling.
Last but not least, for a sanity check, just make sure the parameters of the model is updating. You can also check the training accuracy to make sure the accuracy keeps increasing as the training proceeds.
Although it is difficult to say what is exactly wrong in your code but I hope that the above points will help!
Upvotes: 5