Reputation: 418
So I want to use Dataset API for batching my large dataset (~8GB) as I am suffering from large idle times when using my GPU as I am passing data from python to Tensorflow using feed_dict.
When I follow the tutorial as mentioned here:
When running my simple code:
one_hot_dataset = np.load("one_hot_dataset.npy")
dataset = tf.data.Dataset.from_tensor_slices(one_hot_dataset)
I am getting the error message with TensorFlow 1.8 and Python 3.5:
Traceback (most recent call last):
File "<ipython-input-17-412a606c772f>", line 1, in <module>
dataset = tf.data.Dataset.from_tensor_slices((one_hot_dataset))
File "/anaconda2/envs/tf/lib/python3.5/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 235, in from_tensor_slices
return TensorSliceDataset(tensors)
File "/anaconda2/envs/tf/lib/python3.5/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 1030, in __init__
for i, t in enumerate(nest.flatten(tensors))
File "/anaconda2/envs/tf/lib/python3.5/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 1030, in <listcomp>
for i, t in enumerate(nest.flatten(tensors))
File "/anaconda2/envs/tf/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1014, in convert_to_tensor
as_ref=False)
File "/anaconda2/envs/tf/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1104, in internal_convert_to_tensor
ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
File "/anaconda2/envs/tf/lib/python3.5/site-packages/tensorflow/python/framework/constant_op.py", line 235, in _constant_tensor_conversion_function
return constant(v, dtype=dtype, name=name)
File "/anaconda2/envs/tf/lib/python3.5/site-packages/tensorflow/python/framework/constant_op.py", line 214, in constant
value, dtype=dtype, shape=shape, verify_shape=verify_shape))
File "/anaconda2/envs/tf/lib/python3.5/site-packages/tensorflow/python/framework/tensor_util.py", line 496, in make_tensor_proto
"Cannot create a tensor proto whose content is larger than 2GB.")
ValueError: Cannot create a tensor proto whose content is larger than 2GB.
How can I solve this? I think the cause is obvious but what did the tf developers think by limiting the input data to 2GB ?!? I really cannot understand this rational and what is the workaround when dealing with larger datasets?
I googled quite a lot but I could not find any similar error message. When I use a FITFH of the numpy dataset, the steps above work without any issues.
I somehow need to tell TensorFlow that I actually will be loading the data batch by batch and probably want to prefetch a few batches in order to keep my GPU busy. But it seems as if it is trying to load the whole numpy dataset at once. So what is the benefit of using the Dataset API, as I am able to reproduce this error by simply trying to load my numpy dataset as a tf.constant into the TensorFlow graph, which is obviously does not fit and I get OOM errors.
Tips and troubleshooting hints appreciated!
Upvotes: 5
Views: 2839
Reputation: 3633
This issue is addressed in the tf.data
user guide (https://www.tensorflow.org/guide/datasets) in "Consuming NumPy arrays" section.
Basically, create a dataset.make_initializable_iterator()
iterator and feed your data at runtime.
If this does not work for some reason, you can write your data to files or create a dataset from Python generator (https://www.tensorflow.org/api_docs/python/tf/data/Dataset#from_generator), where you can put arbitrary Python code including slicing your numpy array and yielding the slice.
Upvotes: 4