Reputation: 857
We have stored a list of *.tfrecord files in a HDFS directory. I'd like to use the new Dataset API but the only example given is to use the old filequeue and string_input_producer (https://www.tensorflow.org/deploy/hadoop). These methods make it difficult to generate epochs amongst other things.
Is there any way to use HDFS with the Dataset API?
Upvotes: 6
Views: 8646
Reputation: 126184
The HDFS file system layer works with both the old queue-based API and the new tf.data
API. Assuming you have configured your system according to the TensorFlow/Hadoop deployment guide, you can create a dataset based on files in HDFS with the following code:
dataset = tf.data.TFRecordDataset(["hdfs://namenode:8020/path/to/file1.tfrecords",
"hdfs://namenode:8020/path/to/file2.tfrecords"])
dataset = dataset.map(lambda record: tf.parse_single_example(record, ...)
# ...
Note that since HDFS is a distributed file system, you might benefit from some of the suggestions in the "Parallelize data extraction" section of the Input Pipeline performance guide.
Upvotes: 13