Image Captioning Example input size of Decoder LSTM Pytorch

Question

I'm new to Pytorch, there is a doubt that am having in the Image Captioning example code . In DcoderRNN class the lstm is defined as ,

self.lstm = nn.LSTM(embed_size, hidden_size, num_layers, batch_first=True)

in the forward function ,

embeddings = self.embed(captions)
embeddings = torch.cat((features.unsqueeze(1), embeddings), 1)

we first embed the captions and then concat the embeddings with the context feature from the EncoderCNN, but the concat increases the size from embed size how we can forward that to the lstm? as the input size of lstm is already defined as embed_size.

Am I missing something here? Thanks in advance .

Wasi Ahmad · Accepted Answer

You can analyze the shape of all input and output tensors and then it will become easier for you to understand what changes you need to make.

Let's say: captions = B x S where S = sentence (caption) length.

embeddings = self.embed(captions)

Now, embeddings = B x S x E where E = embed_size.

embeddings = torch.cat((features.unsqueeze(1), embeddings), 1)

Here, embeddings = B x (S + 1) X E.

My understanding says you are doing wrong here. I guess you should concatenate features along axis=2. Because probably you want to concatenate the image features along with the word embeddings for each word in the caption. So, if you do:

embeddings = torch.cat((features.unsqueeze(1), embeddings), 2)

It results in, embeddings = B X S X (E + F) where E + F = embed_size + img_feat_size

Then you need to revise your LSTM definition as follows.

self.lstm = nn.LSTM(embed_size+img_feat_size, hidden_size, num_layers, batch_first=True)

My experience says, usually, people concatenate image features with word features and pass it to the LSTM layer.

Image Captioning Example input size of Decoder LSTM Pytorch

Answers (1)

Related Questions