Keras image captioning model not compiling because of concatenate layer when mask_zero=True in a previous layer

Question

I am new to Keras and I am trying to implement a model for an image captioning project.

I am trying to reproduce the model from Image captioning pre-inject architecture (The picture is taken from this paper: Where to put the image in an image captioning generator) (but with a minor difference: generating a word at each time step instead of only generating a single word at the end), in which the inputs for the LSTM at the first time step are the embedded CNN features. The LSTM should support variable input length and in order to do this I padded all the sequences with zeros so that all of them have maxlen time steps.

The code for the model I have right now is the following:

def get_model(model_name, batch_size, maxlen, voc_size, embed_size, 
        cnn_feats_size, dropout_rate):

    # create input layer for the cnn features
    cnn_feats_input = Input(shape=(cnn_feats_size,))

    # normalize CNN features 
    normalized_cnn_feats = BatchNormalization(axis=-1)(cnn_feats_input)

    # embed CNN features to have same dimension with word embeddings
    embedded_cnn_feats = Dense(embed_size)(normalized_cnn_feats)

    # add time dimension so that this layer output shape is (None, 1, embed_size)
    final_cnn_feats = RepeatVector(1)(embedded_cnn_feats)

    # create input layer for the captions (each caption has max maxlen words)
    caption_input = Input(shape=(maxlen,))

    # embed the captions
    embedded_caption = Embedding(input_dim=voc_size,
                                 output_dim=embed_size,
                                 input_length=maxlen)(caption_input)

    # concatenate CNN features and the captions.
    # Ouput shape should be (None, maxlen + 1, embed_size)
    img_caption_concat = concatenate([final_cnn_feats, embedded_caption], axis=1)

    # now feed the concatenation into a LSTM layer (many-to-many)
    lstm_layer = LSTM(units=embed_size,
                      input_shape=(maxlen + 1, embed_size),   # one additional time step for the image features
                      return_sequences=True,
                      dropout=dropout_rate)(img_caption_concat)

    # create a fully connected layer to make the predictions
    pred_layer = TimeDistributed(Dense(units=voc_size))(lstm_layer)

    # build the model with CNN features and captions as input and 
    # predictions output
    model = Model(inputs=[cnn_feats_input, caption_input], 
                  outputs=pred_layer)

    optimizer = Adam(lr=0.0001, 
                     beta_1=0.9, 
                     beta_2=0.999, 
                     epsilon=1e-8)

    model.compile(loss='categorical_crossentropy',optimizer=optimizer)
    model.summary()

    return model

The model (as it is above) compiles without any errors (see: model summary) and I managed to train it using my data. However, it doesn't take into account the fact that my sequences are zero-padded and the results won't be accurate because of this. When I try to change the Embedding layer in order to support masking (also making sure that I use voc_size + 1 instead of voc_size, as it's mentioned in the documentation) like this:

embedded_caption = Embedding(input_dim=voc_size + 1,
                             output_dim=embed_size,
                             input_length=maxlen, mask_zero=True)(caption_input)

I get the following error:

Traceback (most recent call last):
  File "/export/home/.../py3_env/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1567, in _create_c_op
    c_op = c_api.TF_FinishOperation(op_desc)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Dimension 0 in both shapes must be equal, but are 200 and 1. Shapes are [200] and [1]. for 'concatenate_1/concat_1' (op: 'ConcatV2') with input shapes: [?,1,200], [?,25,1], [] and with computed input tensors: input[2] = <1>

I don't know why it says the shape of the second array is [?, 25, 1], as I am printing its shape before the concatenation and it's [?, 25, 200] (as it should be). I don't understand why there'd be an issue with a model that compiles and works fine without that parameter, but I assume there's something I am missing.

I have also been thinking about using a Masking layer instead of mask_zero=True, but it should be before the Embedding and the documentation says that the Embedding layer should be the first layer in a model (after the input).

Is there anything I could change in order to fix this or is there a workaround to this ?

Y. Luo · Accepted Answer

The non-equal shape error refers to the mask rather than the tensors/inputs. With concatenate supporting masking, it need to handle mask propagation. Your final_cnn_feats doesn't have mask (None), while your embedded_caption has a mask of shape (?, 25). You can find this out by doing:

print(embedded_caption._keras_history[0].compute_mask(caption_input))

Since final_cnn_feats has no mask, concatenate will give it a all non-zero mask for proper mask propagation. While this is correct, the shape of the mask, however, has the same shape as final_cnn_feats which is (?, 1, 200) rather than (?, 1), i.e. masking all features at all time step rather than just all time step. This is where the non-equal shape error comes from ((?, 1, 200) vs (?, 25)).

To fix it, you need to give final_cnn_feats a correct/matching mask. Now I'm not familiar with your project here. One option is to apply a Masking layer to final_cnn_feats, since it is designed to mask timestep(s).

final_cnn_feats = Masking()(RepeatVector(1)(embedded_cnn_feats))

This can be correct only when not all 200 features in final_cnn_feats are zero, i.e. there is always at least one non-zero value in final_cnn_feats. With that condition, Masking layer will give a (?, 1) mask and will not mask the single time step in final_cnn_feats.

Keras image captioning model not compiling because of concatenate layer when mask_zero=True in a previous layer

Answers (1)

Related Questions