Reputation: 1168
I have the following code:
data = np.load("data.npy")
print(data) # Makes sure the array gets loaded in memory
dataset = tf.contrib.data.Dataset.from_tensor_slices((data))
The file "data.npy"
is 3.3 GB. Reading the file with numpy takes a couple of seconds but then the next line that creates the tensorflow dataset object takes ages to execute. Why is that? What is it doing under the hood?
Upvotes: 7
Views: 5973
Reputation: 149
Try:
data = np.load("data.npy")
a = tf.placeholder(tf.float32, shape)
dataset = tf.data.Dataset.from_tensor_slices(a)
dataset = dataset.prefetch(buffer_size=1000)
dataset = dataset.batch(128)
iterator = dataset.make_initializable_iterator()
next_batch = iterator.get_next()
with tf.Session() as sess:
sess.run(iterator.initializer, feed_dict={a: data})
When processing large dataset, tf.placeholder
is better.
Upvotes: 0
Reputation: 6365
Quoting this answer:
np.load
of anpz
just returns a file loader, not the actual data. It's a 'lazy loader', loading the particular array only when accessed.
That is why it is fast.
Edit 1: to expand a bit more this answer, another quote from tensorflow's documentation:
If all of your input data fit in memory, the simplest way to create a
Dataset
from them is to convert them totf.Tensor
objects and useDataset.from_tensor_slices()
.This works well for a small dataset, but wastes memory---because the contents of the array will be copied multiple times---and can run into the 2GB limit for the tf.GraphDef protocol buffer.
The link also shows how to do it a efficiently.
Upvotes: 5