Lukeyb
Lukeyb

Reputation: 857

Tensorflow Dataset API with HDFS

We have stored a list of *.tfrecord files in a HDFS directory. I'd like to use the new Dataset API but the only example given is to use the old filequeue and string_input_producer (https://www.tensorflow.org/deploy/hadoop). These methods make it difficult to generate epochs amongst other things.

Is there any way to use HDFS with the Dataset API?

Upvotes: 6

Views: 8646

Answers (1)

mrry
mrry

Reputation: 126184

The HDFS file system layer works with both the old queue-based API and the new tf.data API. Assuming you have configured your system according to the TensorFlow/Hadoop deployment guide, you can create a dataset based on files in HDFS with the following code:

dataset = tf.data.TFRecordDataset(["hdfs://namenode:8020/path/to/file1.tfrecords",
                                   "hdfs://namenode:8020/path/to/file2.tfrecords"])
dataset = dataset.map(lambda record: tf.parse_single_example(record, ...)
# ...

Note that since HDFS is a distributed file system, you might benefit from some of the suggestions in the "Parallelize data extraction" section of the Input Pipeline performance guide.

Upvotes: 13

Related Questions