FesianXu
FesianXu

Reputation: 447

How to load data parallelly in tensorflow?

At first I will describe my application background:

There are about 500,000 videos saved as avi files in my disk and i will use them as training samples. To use them we can load them simultaneously into the memory and then feed each batch into the model for trianing, which is the easiest way. However my memory is NOT big enough for the whole loading. Therefore i need to load the video data batchly. But you know, decode a batch(take 64 here) of video might cost a lot of time and if you do that serially, we will waste a lot of time in the data loading part instead of computing. Thus i want to batchly load the data parallelly, in fact, just like the API fit_generator in keras. I wonder if there is a existing way to do that in TensorFlow.

Thanks for any suggestion:)

PS: i used to implement the idea by the theading package in Python, for more, visit here https://github.com/FesianXu/Parallel-DataLoader-in-TensorFlow

of course it is just a toy code and too ad hoc. I wanna a more general solution just like fit_generator in Keras.

Upvotes: 2

Views: 2048

Answers (2)

jspcal
jspcal

Reputation: 51894

Take a look at tf.data.Dataset.from_generator:

Creates a Dataset whose elements are generated by generator.

The generator argument must be a callable object that returns an object that support the iter() protocol (e.g. a generator function). The elements generated by generator must be compatible with the given output_types and (optional) output_shapes arguments.

This example shows how to easily parallelize the generator using tf.data.Dataset.map with the num_parallel_calls parameter: https://github.com/tensorflow/tensorflow/issues/14448#issuecomment-349240274

More info: https://www.tensorflow.org/guide/data_performance#parallelizing_data_extraction

Upvotes: 3

AKX
AKX

Reputation: 168834

Tensorflow has the Dataset API for this sort of thing.

See the tutorial and API documentation:

Upvotes: 1

Related Questions