Tensorflow Dataset API with HDFS

Question

We have stored a list of *.tfrecord files in a HDFS directory. I'd like to use the new Dataset API but the only example given is to use the old filequeue and string_input_producer (https://www.tensorflow.org/deploy/hadoop). These methods make it difficult to generate epochs amongst other things.

Is there any way to use HDFS with the Dataset API?

mrry · Accepted Answer

The HDFS file system layer works with both the old queue-based API and the new tf.data API. Assuming you have configured your system according to the TensorFlow/Hadoop deployment guide, you can create a dataset based on files in HDFS with the following code:

dataset = tf.data.TFRecordDataset(["hdfs://namenode:8020/path/to/file1.tfrecords",
                                   "hdfs://namenode:8020/path/to/file2.tfrecords"])
dataset = dataset.map(lambda record: tf.parse_single_example(record, ...)
# ...

Note that since HDFS is a distributed file system, you might benefit from some of the suggestions in the "Parallelize data extraction" section of the Input Pipeline performance guide.

Tensorflow Dataset API with HDFS

Answers (1)

Related Questions