starlord
starlord

Reputation: 21

Data Preprocessing - Input Shape for TimeDistributed CNN (LRCN) & ConvLSTM2D for Video Classification

I'm trying to do binary classification for labeled data for 300+ videos. The goal is to extract features using a ConvNet and feed into to an LSTM for sequencing with a binary output after evaluating all the frames in the video. I've preprocessed each video to have exactly 200 frames with each image being 256 x 256 so that it would be easier to feed into a DNN and split the dataset into two folders as labels. (e.g. dog and cat)

However, after searching stackoverflow for hours, I'm still unsure how to reshape the dataset of video frames so that the model accounts for the number of frames. I'm trying to feed the video frames into a 3D ConvNets and TimeDistributed (2DConvNets) + LSTM, (e.g. (300, 200, 256, 256, 3) ) with no luck. I'm able to perform 2D ConvNet classification (data is a 4D Tensor, need to add a time step dimension to make it a 5D Tensor ) pretty easily but now having issues wrangling with the temporal aspect.

I've been using Keras ImageDataGenerator and train_datagen.flow_from_directory to read in the images and have been running into shape mismatch errors when I attempt to feed it to a TimeDistributed ConvNet. I know hypothetically if I have a X_train dataset I can potentially do X_train = X_train.reshape(...). Any example code would be very much appreciated.

Upvotes: 2

Views: 1251

Answers (1)

GoFindTruth
GoFindTruth

Reputation: 220

I think you could use ConvLSTM2D in Keras for your purpose. ImageDataGenerator is very good for CNN with images, but may be not convenient for CRNN with videos.

You have already transformed your 300 videos data in the same shape (200, 256, 256, 3), each video 200 frames, each frame 256x256 rgb. Next, you need to load them in a numpy array in shape (300, 200, 256, 256, 3). For reading videos in numpy arrays see this answer.

Then you can feed the data in a CRNN. Its first ConvLSTM2D layer should have input_shape = (None, 200, 256, 256, 3).

A sample according to your data: (only illustrated and not tested)

from keras.models import Sequential
from keras.layers import Dense
from keras.layers.convolutional_recurrent import ConvLSTM2D

model = Sequential()
model.add(ConvLSTM2D(filters = 32, kernel_size = (5, 5), input_shape = (None, 200, 256, 256, 3)))
### model.add(...more layers)
model.add(Dense(units = num_of_categories, # num of your vedio categories
                kernel_initializer = 'Orthogonal', activation = 'softmax'))
model.compile(optimizer = 'adam', loss = 'categorical_crossentropy', metrics = ['accuracy'])

# then train it
model.fit(video_data, # shape (300, 200, 256, 256, 3)
          [list of categories],
          batch_size = 20,
          epochs = 50,
          validation_split = 0.1)

I hope this could be a little helpful.

Upvotes: 1

Related Questions