I have thousands of videos and each of them includes constant number of frames which is 35. I try to classify videos via training a LSTM model. But I do not know exactly how people keep the sequential structure of a video and train a LSTM model. Therefore what I want to do is; Read a video from dataset Get 35 frames of that video and extract features for each frame via CNN Feed those 35 frame features to LSTM layer - How could I feed each video(35 frames) to LSTM batch by batch? The fit function in Keras is highly used. However I do not know how could I keep the sequential structure of the video while I read all the data in to the memory for the fit function. rm.model.fit(X,y,batch_size=batch_size, validation_data=(X_test, y_test),verbose=1, epochs=100) Could someone please explain me how people train LSTM model with videos(N number of frames) I hope I could explain myself clearly. Thanks in advance

computer-visiondeep-learningkeraslstmrecurrent-neural-network

Reputation: 450

Structure of training a LSTM model with Videos(constant number of frames)

I have thousands of videos and each of them includes constant number of frames which is 35. I try to classify videos via training a LSTM model. But I do not know exactly how people keep the sequential structure of a video and train a LSTM model.

Therefore what I want to do is;

Read a video from dataset
Get 35 frames of that video and extract features for each frame via CNN
Feed those 35 frame features to LSTM layer - How could I feed each video(35 frames) to LSTM batch by batch?

The fit function in Keras is highly used. However I do not know how could I keep the sequential structure of the video while I read all the data in to the memory for the fit function.

rm.model.fit(X,y,batch_size=batch_size, validation_data=(X_test, y_test),verbose=1, epochs=100)

Could someone please explain me how people train LSTM model with videos(N number of frames)

I hope I could explain myself clearly.

Thanks in advance

Upvotes: 1

Answers (1)

Daniel Möller

Reputation: 86600

From the documentation, we can see that the input shapes expected by all Keras recurrent layers is:

(None, TimeSteps, DataDimension)

In Keras shapes, None is the number of examples you have.

So, in a first simple approach, you must have your training data shaped as:

(NumberOfVideos, NumberOfFrames, height * width * channels)

And your first layer (if the first layer is an LSTM) should use:

LSTM(AnyNumberOfCells, input_shape=(NumberOfFrames, height * width * channels))

The batch size (the number of examples) is never taken into account when you are creating the model, it only appears in your training data, that's why Keras shows None for that dimension in messages.

Now, this is a very simple and intuitive way to start, but actually, there is no obligation to shape your training data like this, you can experiment all kinds of ways, as long a you keep for LSTM layers your data shaped as (BatchSize,TimeSteps,DataDimension). A nice way to do it (it seems to me) is to first do some convolutions to reduce the data size before you feed it in an LSTM. The dimension "height * width * channels" is probably way too much to process all at once in an LSTM layer, and will probably lead to memory problems.

If you are having memory problems. You can study "generators" or Keras Sequences. These will be used with the method fit_generator(). Keras will first use the generator to read a limited amount of data, and train only with that data. Still, you will have to make these generators output things in the same format (ASmallerNumberOfVideos, NumberOfFrames, height * width * channels).

Now, if even like this you're still having memory problems, you will have to start using stateful=True layers.

In this case, the "TimeSteps" may be separated in different arrays. Your LSTM layer will not think "ok, this example is done" when you train. The next batch you feed will be considered like "continuing the previous sequence".

Data will be shaped like (NumberOfVideos,ReducedNumberOfFrames, h*w).

In this case, you will have to manually reset the state of the network with .reset_states() every time you finish a sequence after training enough "ReducedNumberOfFrames".

You can conjugate the two ideas by also training like (ReducedNumberOfVideos,ReducedNumberOfFrames,h*w), as long as you keep a good control of your training and .reset_states() at the correct points.

Upvotes: 4

Structure of training a LSTM model with Videos(constant number of frames)

Answers (1)

Related Questions