cruvadom
cruvadom

Reputation: 354

Caching a dataset with examples of varied length

My dataset is comprised of audio segments of between 5-180 seconds. The number of examples is small enough to allow caching it in memory, instead of reading from the disk over and over. Storing the data in a constant tensor / variable and using tf.train.slice_input_producer will allow me to cache the dataset in memory, but it requires storing all the data in one matrix. Since some examples are much longer than others, this matrix might be unnecessarily large and perhaps too large for the RAM.

I can simply have a list of numpy arrays for my data, and do the whole input reading-randomizing-preprocessing in a non-tensforflow way with a feed_dict, but I wonder if there is a way to do it without completely giving up on tensorflow for the input reading-randomizing-preprocessing part.

Thanks!

Upvotes: 0

Views: 1498

Answers (1)

Olivier Moindrot
Olivier Moindrot

Reputation: 28198

The more recent tf.data library provides a tf.data.Dataset.cache method to cache an entire dataset into memory or into a file.

For instance:

dataset = ...
dataset = dataset.map(preprocessing_fn)  # apply preprocessing
dataset = dataset.cache()  # cache entire dataset in memory after preprocessing

I've provided more details on how to use cache() in this answer.

Upvotes: 2

Related Questions