concatenation of lstm outputs

Question

I'm trying to build a multitask image captioning model, which contains of two separate encoder-decoder models with lstms, each of which takes inputs from different datasets, and then outputs of lstms are combined via concatenate, and output of a concatenation layer then passed to Dense. Here is a model code:

 def define_model(vocab_size1, max_length1, vocab_size2, max_length2):

  # first
  inputs1 = Input(shape=(4096,))
  print(inputs1.shape)
  fe1_1 = Dropout(0.5)(inputs1)
  fe2_1 = Dense(EMBEDDING_DIM, activation='relu')(fe1_1)
  fe3_1 = RepeatVector(max_length1)(fe2_1)

  inputs2 = Input(shape=(max_length1,))
  print(inputs2.shape)
  emb2_1 = Embedding(vocab_size1, EMBEDDING_DIM, mask_zero=True)(inputs2)
  
  merged1 = concatenate([fe3_1, emb2_1], name='concat1')
  lm2_1 = LSTM(500, return_sequences=False)(merged1)

  #second
  inputs3 = Input(shape=(4096,))
  fe1_2 = Dropout(0.5)(inputs3)
  fe2_2 = Dense(EMBEDDING_DIM, activation='relu')(fe1_2)
  fe3_2 = RepeatVector(max_length2)(fe2_2)
  
  inputs4 = Input(shape=(max_length2,))
  emb2_2 = Embedding(vocab_size2, EMBEDDING_DIM, mask_zero=True)(inputs4)
  
  merged2 = concatenate([fe3_2, emb2_2], name='concat2')     
  lm2_2 = LSTM(500, return_sequences=False)(merged2)
 

# merge
  merged3 = concatenate([lm2_1, lm2_2], name='concat3') # error
  outputs = Dense(vocab_size1, activation='softmax')(merged3)
  outputs1 = Dense(vocab_size2, activation='softmax')(merged3)

  # tie it together [image, seq] [word]
  model = Model(inputs=[inputs1, inputs2, inputs3, inputs4], outputs=[outputs, outputs1])
  model.compile(loss=['categorical_crossentropy', 'categorical_crossentropy'], optimizer='adam', metrics=['accuracy'])
  print(model.summary())
  # plot_model(model, show_shapes=True, to_file='model.png')
  return model

I can initialize it correctly:

model = define_model(fvocab_size, fmax_length, wvocab_size, wmax_length)

    Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_1 (InputLayer)            [(None, 4096)]       0                                            
__________________________________________________________________________________________________
input_3 (InputLayer)            [(None, 4096)]       0                                            
__________________________________________________________________________________________________
dropout (Dropout)               (None, 4096)         0           input_1[0][0]                    
__________________________________________________________________________________________________
dropout_1 (Dropout)             (None, 4096)         0           input_3[0][0]                    
__________________________________________________________________________________________________
dense (Dense)                   (None, 256)          1048832     dropout[0][0]                    
__________________________________________________________________________________________________
input_2 (InputLayer)            [(None, 34)]         0                                            
__________________________________________________________________________________________________
dense_1 (Dense)                 (None, 256)          1048832     dropout_1[0][0]                  
__________________________________________________________________________________________________
input_4 (InputLayer)            [(None, 21)]         0                                            
__________________________________________________________________________________________________
repeat_vector (RepeatVector)    (None, 34, 256)      0           dense[0][0]                      
__________________________________________________________________________________________________
embedding (Embedding)           (None, 34, 256)      1940224     input_2[0][0]                    
__________________________________________________________________________________________________
repeat_vector_1 (RepeatVector)  (None, 21, 256)      0           dense_1[0][0]                    
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, 21, 256)      1428992     input_4[0][0]                    
__________________________________________________________________________________________________
concat1 (Concatenate)           (None, 34, 512)      0           repeat_vector[0][0]              
                                                                 embedding[0][0]                  
__________________________________________________________________________________________________
concat2 (Concatenate)           (None, 21, 512)      0           repeat_vector_1[0][0]            
                                                                 embedding_1[0][0]                
__________________________________________________________________________________________________
lstm (LSTM)                     (None, 500)          2026000     concat1[0][0]                    
__________________________________________________________________________________________________
lstm_1 (LSTM)                   (None, 500)          2026000     concat2[0][0]                    
__________________________________________________________________________________________________
concat3 (Concatenate)           (None, 1000)         0           lstm[0][0]                       
                                                                 lstm_1[0][0]                     
__________________________________________________________________________________________________
dense_2 (Dense)                 (None, 7579)         7586579     concat3[0][0]                    
__________________________________________________________________________________________________
dense_3 (Dense)                 (None, 5582)         5587582     concat3[0][0]                    
==================================================================================================
Total params: 22,693,041
Trainable params: 22,693,041
Non-trainable params: 0

Input shapes of Concatenate is (None, 500), (None, 500) and output is (None, 1000). However, when passing actual data throuh generator, I get an error:

`InvalidArgumentError                      Traceback (most recent call last)
 in ()
     12 
     13 model.fit(train_generator, epochs=20,  verbose=1, steps_per_epoch=steps, validation_steps=val_steps,
---> 14     callbacks=[checkpoint], validation_data=val_generator)
     15 
     16 try:

6 frames
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_batch_size, validation_freq, max_queue_size, workers, use_multiprocessing)
   1098                 _r=1):
   1099               callbacks.on_train_batch_begin(step)
-> 1100               tmp_logs = self.train_function(iterator)
   1101               if data_handler.should_sync:
   1102                 context.async_wait()

/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py in __call__(self, *args, **kwds)
    826     tracing_count = self.experimental_get_tracing_count()
    827     with trace.Trace(self._name) as tm:
--> 828       result = self._call(*args, **kwds)
    829       compiler = "xla" if self._experimental_compile else "nonXla"
    830       new_tracing_count = self.experimental_get_tracing_count()

/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py in _call(self, *args, **kwds)
    886         # Lifting succeeded, so variables are initialized and we can run the
    887         # stateless function.
--> 888         return self._stateless_fn(*args, **kwds)
    889     else:
    890       _, _, _, filtered_flat_args = \

/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py in __call__(self, *args, **kwargs)
   2941        filtered_flat_args) = self._maybe_define_function(args, kwargs)
   2942     return graph_function._call_flat(
-> 2943         filtered_flat_args, captured_inputs=graph_function.captured_inputs)  # pylint: disable=protected-access
   2944 
   2945   @property

/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py in _call_flat(self, args, captured_inputs, cancellation_manager)
   1917       # No tape is watching; skip to running the function.
   1918       return self._build_call_outputs(self._inference_function.call(
-> 1919           ctx, args, cancellation_manager=cancellation_manager))
   1920     forward_backward = self._select_forward_and_backward_functions(
   1921         args,

/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py in call(self, ctx, args, cancellation_manager)
    558               inputs=args,
    559               attrs=attrs,
--> 560               ctx=ctx)
    561         else:
    562           outputs = execute.execute_with_cancellation(

/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/execute.py in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name)
     58     ctx.ensure_initialized()
     59     tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
---> 60                                         inputs, attrs, num_outputs)
     61   except core._NotOkStatusException as e:
     62     if name is not None:

InvalidArgumentError:  All dimensions except 1 must match. Input 1 has shape [4 500] and doesn't match input 0 with shape [47 500].
     [[node gradient_tape/model/concat3/ConcatOffset (defined at :14) ]] [Op:__inference_train_function_14543]

Function call stack:
train_function`

code of generator:

def create_sequences(tokenizer, max_length, desc_list, photo):
  vocab_size = len(tokenizer.word_index) + 1
  X1, X2, y = [], [], []
  # walk through each description for the image
  for desc in desc_list:
    # encode the sequence
    seq = tokenizer.texts_to_sequences([desc])[0]
    # split one sequence into multiple X,y pairs
    for i in range(1, len(seq)):
      # split into input and output pair
      in_seq, out_seq = seq[:i], seq[i]
      # pad input sequence
      in_seq = pad_sequences([in_seq], maxlen=max_length)[0]
      # encode output sequence
      out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]
      # store
      X1.append(photo)
      X2.append(in_seq)
      y.append(out_seq)
  return np.array(X1), np.array(X2), np.array(y)


def double_generator(descriptions1, photos1, tokenizer1, max_length1,
                       descriptions2, photos2, tokenizer2, max_length2, n_step=1):
  while True:
    # loop over photo identifiers in the dataset
    keys1 = list(descriptions1.keys())
    keys2 = list(descriptions2.keys())    # len(keys1) == len(keys2)
    for i in range(0, len(keys1), n_step):
      Ximages1, XSeq1, y1 = list(), list(),list()
      Ximages2, XSeq2, y2 = list(), list(),list()
      for j in range(i, min(len(keys1), i+n_step)):
        image_id1 = keys1[j]
        # retrieve the photo feature
        photo1 = photos1[image_id1][0]
        desc_list1 = descriptions1[image_id1]
        # print(desc_list)
        in_img1, in_seq1, out_word1 = create_sequences(tokenizer1, max_length1, desc_list1, photo1)
        # print(in_img, in_seq, out_word)
        for k in range(len(in_img1)):
          Ximages1.append(in_img1[k])
          XSeq1.append(in_seq1[k])
          y1.append(out_word1[k])
        # print('Ximages1', Ximages1)
        # print('Xseq1', XSeq1)
        # print('y1', y1)
      for j in range(i, min(len(keys2), i+n_step)):
        image_id2 = keys2[j]
        # retrieve the photo feature
        photo2 = photos2[image_id2][0]
        desc_list2 = descriptions2[image_id2]
        # print(desc_list)
        in_img2, in_seq2, out_word2 = create_sequences(tokenizer2, max_length2, desc_list2, photo2)
        # print(in_img, in_seq, out_word)
        for k in range(len(in_img2)):
          Ximages2.append(in_img2[k])
          XSeq2.append(in_seq2[k])
          y2.append(out_word2[k])
        # print('Ximages2', Ximages2)
        # print('Xseq2', XSeq2)
        # print('y2', y2)
      yield ([np.array(Ximages1), np.array(XSeq1), np.array(Ximages2), np.array(XSeq2)], [np.array(y1), np.array(y2)])

Everything works fine when there is only one dataset and no lstms concatenation(with simple image captioning)

Shapes of inputs in error change when I call next(generator) and as I understad corellate with description length, though I use padding.

Keras tutorial on functional api contains similar to mine example called Manipulate complex graph topologies https://keras.io/guides/functional_api/ which also works with lstms concatenation, and I don't see why it's not working in my case without any reshaping.

I tried:

change concatenate to layers.Concatenate,
change mask_zero=True to False in Embeddings,
create common tokenizer for both sets of descriptions in datasets,
change concatenation axis to 0(then arises a problem with logits).

Thanks in advance

Akshay Sehgal · Accepted Answer

TLDR;

You are trying to send 47 samples and 4 samples for different inputs via the generator at the same time. The neural network is throwing an error because you are passing them via the first channel none which can take variable batch sizes. But when the tensor shaped (47,500) and (4, 500) from the 2 lstms, reaches the concatenate layer, the layer is not able to concatenate them over the first axis, as expected. So you get an error while training and not while compiling.

If you are trying to generate a single sample (1 row of data) at a time via your generator, then perhaps you have 2D inputs shaped (47,4096) and (4,4096). In this case, you should reshape them to (1,47,4096) and (1,4,4096). This would change your architecture completely but would be inline with what I think you are trying to do.

Details -

The issue is that you are passing different sized batches as inputs to the model. This is because the first channel none takes the batch size.

Let's look at what happens in your model step by step only for 2 inputs (Ximages1, and Ximages2).

You first pass (for each batch from the generator)

Input Layer -

input_1 (InputLayer) [(None, 4096)] #(47, 4096) Ximages1
input_3 (InputLayer) [(None, 4096)] #(4, 4096)  Ximages2

These go into intermediate layers until they reach the individual LSTMs.

LSTM Layers -

lstm (LSTM)   (None, 500) concat1[0][0] #(47, 500)              
lstm_1 (LSTM) (None, 500) concat2[0][0] #(4, 500)

Now the next layer, concatenate tries to combine the 2 layers into a single one as -

concat3 (Concatenate) (None, 1000) lstm[0][0]  #(47, 500)                  
                                   lstm_1[0][0] #(4, 500)

From an architecture point of view, it can concatenate (none, 500) and the second (none, 500) over the first channel (batch_size), however, the assumption being that the same number of samples are received by the layer for each batch.

In other words, you can't concatenate a (47, 500) with a (4,500) over the first axis.

You may want to reconsider how you are creating the generator output batches.
If (47, 4096) and (4, 4096) are supposed to be a single sample, you may want to output them as a 3D tensor instead of a 2D (1, 47, 4096) and (1, 4, 4096).
This way your input layers would take (None, 47, 4096) and (None, 4, 4096).
This will accordingly change every layer you subsequently because now you have to work with an extra channel.

concatenation of lstm outputs

Answers (1)

TLDR;

Details -

Related Questions