How to connect convolution layer with lstm layer to in seq2seq tasks?

Question

The seq2seq tasks is to recognize the sentences from video data (also known as visual-only speech recognition/Lip-reading).

The model is consisted of convolutional layers and a lstm layer. However, the output of convolutional layers is in the shape of [batch_size, height, width, channel_size]; while the input of lstm layer must be in the shape of [batch_size, n_steps, dimension].

The workflow is like:

First, the data is organized as [batch_size, n_steps, height, width, channel_size].
Then I reshape it into [batch_size*n_steps, height, width, channel_size] and feed it into conv layers.
The output of conv layers is [batch_size*n_steps, height', width', channel_size'].
Of course I can reshape it into [batch_size, n_steps, height', width', channel_size'], but how can I feed it into lstm layer, which requires data in the shape of [batch_size, n_steps, dimension]?

I don't know if just reshaping the axis [height', width', channel_size'] into only one axis of [dimension] is appropriate in this visual-only speech recognition task.

Tips:

The height' and width' won't be equal to 1.
The code is implemented in tensorflow, and for some reason I must use the low level api rather than tf.keras.
The model is constructed as the paper END-TO-END LOW-RESOURCE LIP-READING WITH MAXOUT CNN AND LSTM

How to connect convolution layer with lstm layer to in seq2seq tasks?

Answers (1)

Related Questions