How to structure my video dataset based on extracted features for building a CNN-LSTM classification model?

Question

For my project which deals with the recognition of emotions, I have a dataset consisting of multiple videos, which range from 0.5s-10s. I have an application which goes through each video and creates a .csv file containing the features it has extracted from each frame in the video, i.e., each row represents each frame from the video (with no. of rows being variable) and the columns represent the different features the application has extracted from the frame (with no. of columns being fixed). Each .csv filename also contains a code representing the emotion being expressed in the video.

Initially, my plan was to extract each frame from the video and pass each frame as input to the following CNN-LSTM (CNN for the spatial features and LSTM for the temporal features) model I was planning on using.

    model = Sequential()

    model.add(Input(input_shape))

    model.add(Conv3D(6, (1, 5, 5), (1, 1, 1), activation='relu', name='conv-1'))
    model.add(AveragePooling3D((1, 2, 2), strides=(1, 2, 2), name='avgpool-1'))

    model.add(Conv3D(16, (1, 5, 5), (1, 1, 1), activation='relu', name='conv-2'))
    model.add(AveragePooling3D((1, 2, 2), strides=(1, 2, 2), name='avgpool-2'))

    model.add(Conv3D(32, (1, 5, 5), (1, 1, 1), activation='relu', name='conv-3'))
    model.add(AveragePooling3D((1, 2, 2), strides=(1, 2, 2), name='avgpool-3'))

    model.add(Conv3D(64, (1, 4, 4), (1, 1, 1), activation='relu', name='conv-4'))
    model.add(Reshape((30, 64), name='reshape'))

    model.add(CuDNNLSTM(64, return_sequences=True, name='lstm-1'))
    model.add(CuDNNLSTM(64, name='lstm-2'))

    model.add(Dense(6, activation=tf.nn.softmax, name='result'))

I still plan on using a CNN-LSTM model but I don't know how to structure my dataset now. I thought of labelling each frame in each .csv file with the corresponding emotion label and then combining all the .csv files into a single .csv file. This combined .csv file would then be passed to the above model, after changing the input shape and other necessary parameters, but I don't know if the model would be able to differentiate between the videos if done in that way.

So to conclude, I need help structuring my dataset and how this dataset should be passed to a CNN-LSTM model.

How to structure my video dataset based on extracted features for building a CNN-LSTM classification model?

Answers (1)

Related Questions