Tensorflow Dataset.from_tensor_slices taking too long

I have the following code:

data = np.load("data.npy")
print(data) # Makes sure the array gets loaded in memory
dataset = tf.contrib.data.Dataset.from_tensor_slices((data))

The file "data.npy" is 3.3 GB. Reading the file with numpy takes a couple of seconds but then the next line that creates the tensorflow dataset object takes ages to execute. Why is that? What is it doing under the hood?

Upvotes: 7

Answers (2)

Ocxs

Reputation: 149

Try:

data = np.load("data.npy")
a = tf.placeholder(tf.float32, shape)
dataset = tf.data.Dataset.from_tensor_slices(a)
dataset = dataset.prefetch(buffer_size=1000)
dataset = dataset.batch(128)
iterator = dataset.make_initializable_iterator()
next_batch = iterator.get_next()
with tf.Session() as sess:
    sess.run(iterator.initializer, feed_dict={a: data})

When processing large dataset, tf.placeholder is better.

Upvotes: 0

Julio Daniel Reyes

Reputation: 6365

Quoting this answer:

np.load of a npz just returns a file loader, not the actual data. It's a 'lazy loader', loading the particular array only when accessed.

That is why it is fast.

Edit 1: to expand a bit more this answer, another quote from tensorflow's documentation:

If all of your input data fit in memory, the simplest way to create a Dataset from them is to convert them to tf.Tensor objects and use Dataset.from_tensor_slices().

This works well for a small dataset, but wastes memory---because the contents of the array will be copied multiple times---and can run into the 2GB limit for the tf.GraphDef protocol buffer.

The link also shows how to do it a efficiently.

Upvotes: 5

Tensorflow Dataset.from_tensor_slices taking too long

Answers (2)

Related Questions