Image captioning giving weak results

Question

I am trying to build an image captioning model.

modelV = createVGG16()
modelV.trainable = False
# DISCARD LAST 2 LAYERS
modelV.layers.pop()
modelV.layers.pop()

print 'LOADED VISION MODULE'

modelL = Sequential()
# CONVERTING THE INPUT PARTIAL CAPTION INDEX VECTOR TO DENSE VECTOR REPRESENTATION
modelL.add(Embedding(self.vocab_size, 256, input_length=self.max_cap_len))
modelL.add(LSTM(128,return_sequences=True))
modelL.add(TimeDistributed(Dense(128)))

print 'LOADED LANGUAGE MODULE'

# REPEATING IMAGE VECTOR TO TURN INTO A SEQUENCE
modelV.add(RepeatVector(self.max_cap_len))

print 'LOADED REPEAT MODULE'

model = Sequential()
model.add(Merge([modelV, modelL], mode='concat', concat_axis=-1))
# ENCODING THE VECTOR SEQ INTO A SINGLE VECTOR
# WHICH WILL BE USED TO COMPUTE THE PROB DISTRIB OF THE NEXT WORD
# IN THE CAPTION
model.add(LSTM(256,return_sequences=False))
model.add(Dense(self.vocab_size))
model.add(Activation('softmax'))

if(ret_model==True):
    return model

model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy'])

print 'COMBINED MODULES'
# OUTPUT WILL BE OF SHAPE (samples, max_caption_len, 128)
return model

I have tried running this model on the all 5 captions of the first 100 images of FLickr8k test dataset for 50 epochs. All captions are prepended with and concatenated with . To generate the caption I am giving the input image and as the initial word. With each iteration I predict the probability distribution over the vocabulary and obtain the next word. In the next iteration I give PredictedWord as the input and generate the probability distribution again.

What happens is that I get the same probability distribution in every timestep.

My question is that:

Is my model too small to generate captions?
Is the training data too small?
Is the number of epochs too small?
Is my entire approach wrong?

Wasi Ahmad · Accepted Answer

Before answering your question, I want to ask, what did you mean by iteration in the following statement?

What happens is that I get the same probability distribution in every iteration.

Given an image and the initial word, you should get the next word which should be given as input to generate the next word and this process should go on until you get a special token (ex., EOC) that represent the end of the caption.

Is my model too small to generate captions?

I would say No but may be this model is small to generate good captions.

Is the training data too small?

Yes, only 100 images is not enough to train a image caption generating neural network.

Is the number of epochs too small?

No, 50 epochs is not too small. You can probably try by tuning other parameters, for example, learning rate!

Is my entire approach wrong?

No, your approach is not wrong. You can augment your approach to generate good captions for images. You should find good examples in web, just go through them, I believe you will get ideas from them.

Image captioning giving weak results

Answers (1)

Related Questions