Reputation: 954
The seq2seq tasks is to recognize the sentences from video data (also known as visual-only speech recognition/Lip-reading).
The model is consisted of convolutional layers and a lstm layer. However, the output of convolutional layers is in the shape of [batch_size, height, width, channel_size]
; while the input of lstm layer must be in the shape of [batch_size, n_steps, dimension]
.
The workflow is like:
[batch_size*n_steps, height, width, channel_size]
and feed it into conv layers.[batch_size*n_steps, height', width', channel_size']
. [batch_size, n_steps, height', width', channel_size']
, but how can I feed it into lstm layer, which requires data in the shape of [batch_size, n_steps, dimension]
?I don't know if just reshaping the axis [height', width', channel_size']
into only one axis of [dimension]
is appropriate in this visual-only speech recognition task.
Tips:
tf.keras
.Upvotes: 0
Views: 660
Reputation: 5565
The RNN expects that the input is going to be sequential. Therefore, the input has the shape [time, feature_size]
or if you are processing a batch [batch_size, time, feature_size]
.
In your case, the input has a shape [batch_size, number_of_frames, height, width, num_channels]
. Then, you use a convolutional layer, to learn spatial dependencies between the pixels in each video frame. Therefore, for each video frame, the convolutional layer is going to provide you a tensor with shape [activation_map_width, activation_map_height, number_of_filters]
. Then, because you want to learn a context-dependent representation of the frames, you are safe to reshape everything that you learned for each frame in a 1D sequence.
Finally, what you will provide the RNN is: [b_size, num_frames, am_width * am_height * num_filters]
.
As for the implementation, if we assume that you have 2 videos, and each video has 5 frames, where each frame has width and height of 10 and 3 channels, this is what you should do:
# Batch of 2 videos with 7 frames of size [10, 10, 3]
video = np.random.rand(2, 7, 10, 10, 3).astype(np.float32)
# Flattening all the frames
video_flat = tf.reshape(video, [14, 10, 10, 3])
# Convolving each frame
video_convolved = tf.layers.conv2d(video_flat, 5, [3,3])
# Reshaping the frames back into the corresponding batches
video_batch = tf.reshape(video_convolved, [2, 7, video_convolved.shape[1], video_convolved.shape[2], 5])
# Combining all learned for each frame in 1D
video_flat_frame = tf.reshape(video_batch, [2, 7, video_batch.shape[2] * video_batch.shape[3] * 5])
# Passing the information for each frame through an RNN
outputs, _ = tf.nn.dynamic_rnn(tf.nn.rnn_cell.LSTMCell(9), video_flat_frame, dtype=tf.float32)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
# Output where we have a context-dependent representation for each video frame
print(sess.run(outputs).shape)
Please note that I have hardcoded some of the variables in the code for simplicity.
I hope that this helps you!
Upvotes: 2