bfra
bfra

Reputation: 321

tensorflow input pipeline: samples are read more than once

I'm trying to implement an input pipeline to my model that reads from TFRecords binary files; each binary file contains one example (image, label, other stuff I need)

I have a text file with the file path list; then:

  1. I read the text file as a list, which I feed to string_input_producer() to generate a queue;
  2. I feed the queue to a TFRecordReader that reads the serialized example and I decode the binary data
  3. I use shuffle_batch() to arrange the examples into batches
  4. I use the batches to evaluate my model

the problem is that it turns out that the same example can be read multiple times and some examples may not be visited at all; I set the number of steps as the total number of images divided by the batch size; so I would expect that at the end of the last step all the input examples have been visited, but this is not the case; instead, some are visited more than once and some never (randomly); this makes my test evaluation totally unrealiable

if anybody has a hint of what I am doing wrong, please let me know

simplified version of my code for model testing is below; Thanks!

def my_input(file_list, batch_size)

    filename = []
    f = open(file_list, 'r')
    for line in f:
        filename.append(params.TEST_RECORDS_DATA_DIR + line[:-1])

    filename_queue = tf.train.string_input_producer(filename)

    reader = tf.TFRecordReader()
    _, serialized_example = reader.read(filename_queue)

    features = tf.parse_single_example(
        serialized_example,
        features={
            'image_raw': tf.FixedLenFeature([], tf.string),
            'label_raw': tf.FixedLenFeature([], tf.string),
            'name': tf.FixedLenFeature([], tf.string)
            })

    image = tf.decode_raw(features['image_raw'], tf.uint8)
    image.set_shape(params.IMAGE_HEIGHT*params.IMAGE_WIDTH*3)
    image = tf.reshape(image, (params.IMAGE_HEIGHT,params.IMAGE_WIDTH,3))
    image = tf.cast(image, tf.float32)/255.0
    image = preprocess(image)

    label = tf.decode_raw(features['label_raw'], tf.uint8)
    label.set_shape(params.NUM_CLASSES)

    name = features['name']

    images, labels, image_names = tf.train.batch([image, label, name],
            batch_size=batch_size, num_threads=2,
            capacity=1000 + 3 * batch_size, min_after_dequeue=1000)

    return images, labels, image_names


def main()

    with tf.Graph().as_default():

        # call input operations
        images, labels, image_names = my_input(file_list=params.TEST_FILE_LIST, batch_size=params.BATCH_SIZE)

        # load a trained model and make predictions     
        prediction = infer(images, labels, image_names)

        with tf.Session() as sess:

            for step in range(params.N_STEPS):
                prediction_values = sess.run([prediction])
                # process output

    return

Upvotes: 0

Views: 232

Answers (1)

sygi
sygi

Reputation: 4647

My guess would be that tf.train.string_input_producer(filename) is set to produce the filename indefinitely and if you batch the examples in multiple (2) threads, it may be the case that one thread already started processing the file the second time, whereas the other one didn't manage to finish the first round yet. To read each example exactly one, use:

tf.train.string_input_producer(filename, num_epochs=1)

and initialize local variables at the start of the session:

sess.run(tf.initialize_local_variables())

Upvotes: 0

Related Questions