david
david

Reputation: 954

How to connect convolution layer with lstm layer to in seq2seq tasks?

The seq2seq tasks is to recognize the sentences from video data (also known as visual-only speech recognition/Lip-reading).

The model is consisted of convolutional layers and a lstm layer. However, the output of convolutional layers is in the shape of [batch_size, height, width, channel_size]; while the input of lstm layer must be in the shape of [batch_size, n_steps, dimension].

The workflow is like:

I don't know if just reshaping the axis [height', width', channel_size'] into only one axis of [dimension] is appropriate in this visual-only speech recognition task.

Tips:

Upvotes: 0

Views: 660

Answers (1)

gorjan
gorjan

Reputation: 5565

The RNN expects that the input is going to be sequential. Therefore, the input has the shape [time, feature_size] or if you are processing a batch [batch_size, time, feature_size].

In your case, the input has a shape [batch_size, number_of_frames, height, width, num_channels]. Then, you use a convolutional layer, to learn spatial dependencies between the pixels in each video frame. Therefore, for each video frame, the convolutional layer is going to provide you a tensor with shape [activation_map_width, activation_map_height, number_of_filters]. Then, because you want to learn a context-dependent representation of the frames, you are safe to reshape everything that you learned for each frame in a 1D sequence.

Finally, what you will provide the RNN is: [b_size, num_frames, am_width * am_height * num_filters].

As for the implementation, if we assume that you have 2 videos, and each video has 5 frames, where each frame has width and height of 10 and 3 channels, this is what you should do:

# Batch of 2 videos with 7 frames of size [10, 10, 3]
video = np.random.rand(2, 7, 10, 10, 3).astype(np.float32)
# Flattening all the frames
video_flat = tf.reshape(video, [14, 10, 10, 3])
# Convolving each frame
video_convolved = tf.layers.conv2d(video_flat, 5, [3,3])
# Reshaping the frames back into the corresponding batches
video_batch = tf.reshape(video_convolved, [2, 7, video_convolved.shape[1], video_convolved.shape[2], 5])
# Combining all learned for each frame in 1D
video_flat_frame = tf.reshape(video_batch, [2, 7, video_batch.shape[2] * video_batch.shape[3] * 5])
# Passing the information for each frame through an RNN
outputs, _ = tf.nn.dynamic_rnn(tf.nn.rnn_cell.LSTMCell(9), video_flat_frame, dtype=tf.float32)
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    # Output where we have a context-dependent representation for each video frame
    print(sess.run(outputs).shape)

Please note that I have hardcoded some of the variables in the code for simplicity.

I hope that this helps you!

Upvotes: 2

Related Questions