How to translate deprecated tf.train.QueueRunners tensorflow approach to importing data to new tf.data.Dataset approach

Question

Altough tensorflow recommends very much to not use deprecated functions that are going to be replaced by tf.data objects, there seems to be no good documentation for cleanly replacing the deprecated for the modern approach. Moreover, Tensorflow tutorials still use the deprecated functionality to treat file processing (Reading data tutorial: https://www.tensorflow.org/api_guides/python/reading_data).

On the other hand, though there is good documentation for using the 'modern' approach (Importing data tutorial: https://www.tensorflow.org/guide/datasets), there still exists the old tutorials which will probably lead many, as me, to use the deprecated one first. That is why one would like to cleanly translate the deprecated to the 'modern' approach, and an example for this translation would probably be very useful.

#!/usr/bin/env python3
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import shutil
import os

if not os.path.exists('example'):
    shutil.rmTree('example');
    os.mkdir('example');

batch_sz = 10; epochs = 2; buffer_size = 30; samples = 0;
for i in range(50):
    _x = np.random.randint(0, 256, (10, 10, 3), np.uint8);
    plt.imsave("example/image_{}.jpg".format(i), _x)
images = tf.train.match_filenames_once('example/*.jpg')
fname_q = tf.train.string_input_producer(images,epochs, True);
reader = tf.WholeFileReader()
_, value = reader.read(fname_q)
img = tf.image.decode_image(value)
img_batch = tf.train.batch([img], batch_sz, shapes=([10, 10, 3]));
with tf.Session() as sess:
    sess.run([tf.global_variables_initializer(),
        tf.local_variables_initializer()])
    coord = tf.train.Coordinator()
    threads = tf.train.start_queue_runners(coord=coord)
    for _ in range(epochs):
        try:
            while not coord.should_stop():
                sess.run(img_batch)
                samples += batch_sz;
                print(samples, "samples have been seen")
        except tf.errors.OutOfRangeError:
            print('Done training -- epoch limit reached')
        finally:
            coord.request_stop();
    coord.join(threads)

This code runs perfectly well for me, printing to console:

10 samples have been seen
20 samples have been seen
30 samples have been seen
40 samples have been seen
50 samples have been seen
60 samples have been seen
70 samples have been seen
80 samples have been seen
90 samples have been seen
100 samples have been seen
110 samples have been seen
120 samples have been seen
130 samples have been seen
140 samples have been seen
150 samples have been seen
160 samples have been seen
170 samples have been seen
180 samples have been seen
190 samples have been seen
200 samples have been seen
Done training -- epoch limit reached

As can be seen, it uses deprecated functions and objects as tf.train.string_input_producer() and tf.WholeFileReader(). An equivalent implementation using the 'modern' tf.data.Dataset is needed.

EDIT:

Found already given example for importing CSV data: Replacing Queue-based input pipelines with tf.data. I would like to be as complete as possible here, and suppose that more examples are better, so I don't feel it as a repeated question.

Matias Haeussler · Accepted Answer

Here is the translation, which prints exactly the same to standard output.

#!/usr/bin/env python3
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import os
import shutil

if not os.path.exists('example'):
    shutil.rmTree('example');
    os.mkdir('example');

batch_sz = 10; epochs = 2; buffer_sz = 30; samples = 0;
for i in range(50):
    _x = np.random.randint(0, 256, (10, 10, 3), np.uint8);
    plt.imsave("example/image_{}.jpg".format(i), _x);
fname_data = tf.data.Dataset.list_files('example/*.jpg')\
        .shuffle(buffer_sz).repeat(epochs);
img_batch = fname_data.map(lambda fname: \
        tf.image.decode_image(tf.read_file(fname),3))\
        .batch(batch_sz).make_initializable_iterator();

with tf.Session() as sess:
    sess.run([img_batch.initializer,
        tf.global_variables_initializer(),
        tf.local_variables_initializer()]);
    next_element = img_batch.get_next();
    try:
        while True:
            sess.run(next_element);
            samples += batch_sz
            print(samples, "samples have been seen");
    except tf.errors.OutOfRangeError:
        pass;
    print('Done training -- epoch limit reached');

The main issues are:

Use of tf.data.Dataset.list_files() to load filenames as a dataset, instead of generating a queue with deprecated tf.tran.string_input_producer() for consuming filenames.
Use of iterators to process datasets, which require initialization too, instead of sequent reads to a deprecated tf.WholeFileReader, batched with a deprecated tf.train.batch() function.
A Coordinator is not needed because threads for queues (tf.train.QueueRunners created by tf.train.string_input_producer()) are not used anymore, but it should be checked when dataset iterator has ended.

I hope this will be useful for many, as was for me after achieving it.

Ref:

Importing data: https://www.tensorflow.org/guide/datasets
Medium Datasets Tutorial: https://towardsdatascience.com/how-to-use-dataset-in-tensorflow-c758ef9e4428

BONUS: Dataset + Estimator

#!/usr/bin/env python3
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import os
import shutil

if not os.path.exists('example'):
    shutil.rmTree('example');
    os.mkdir('example');

batch_sz = 10; epochs = 2; buffer_sz = 10000; samples = 0;
for i in range(50):
    _x = np.random.randint(0, 256, (10, 10, 3), np.uint8);
    plt.imsave("example/image_{}.jpg".format(i), _x);
def model(features,labels,mode,params):
    return tf.estimator.EstimatorSpec(
            tf.estimator.ModeKeys.PREDICT,{'images': features});
estimator = tf.estimator.Estimator(model,'model_dir',params={});
def input_dataset():
    return tf.data.Dataset.list_files('example/*.jpg')\
        .shuffle(buffer_sz).repeat(epochs).map(lambda fname: \
            tf.image.decode_image(tf.read_file(fname),3))\
        .batch(batch_sz);

predictions = estimator.predict(input_dataset,
        yield_single_examples=False);
for p_dict in predictions:
    samples += batch_sz;
    print(samples, "samples have been seen");
print('Done training -- epoch limit reached');

The main issues are:

Definition of a model function for a custom estimator for processing images, which in this case does nothing because we are just passing them by.
Definition of an input_dataset function for retriving the dataset to be used (for prediction in this case) by the estimator.
Use of tf.estimator.Estimator.predict() on estimator instead of using tf.Session() directly, plus yield_single_example=False to retrieve batch of elements instead of single in predictions list of dictionaries.

It seems to me like more modular and reusable code.

Ref:

Datasets for estimators: https://www.tensorflow.org/guide/datasets_for_estimators,
Custom estimators: https://www.tensorflow.org/guide/custom_estimators

How to translate deprecated tf.train.QueueRunners tensorflow approach to importing data to new tf.data.Dataset approach

Answers (1)

Related Questions