Read in Large CSV File and feed into TensorFlow

Question

So I'm trying to read my csv file into python and then split the data into training and test data (n-fold Cross-validation) and then feed it into my already made deep learning architecture. However, after reading the TensorFlow tutorial on how to read in csv files, which is shown here:

filename_queue = tf.train.string_input_producer(["file0.csv", "file1.csv"])

reader = tf.TextLineReader()
key, value = reader.read(filename_queue)

# Default values, in case of empty columns. Also specifies the type of the
# decoded result.
record_defaults = [[1], [1], [1], [1], [1]]
col1, col2, col3, col4, col5 = tf.decode_csv(
    value, record_defaults=record_defaults)
features = tf.pack([col1, col2, col3, col4])

with tf.Session() as sess:
  # Start populating the filename queue.
  coord = tf.train.Coordinator()
  threads = tf.train.start_queue_runners(coord=coord)

  for i in range(1200):
    # Retrieve a single instance:
    example, label = sess.run([features, col5])

  coord.request_stop()
  coord.join(threads)

Everything makes sense in this code except the part at the end with the for loop.

Question 1: What is the significance of the 1200 in the for loop? Is the number of records taken from the data?

The next part of the tutorial talks about batching the examples in the code as follows:

def read_my_file_format(filename_queue):
  reader = tf.SomeReader()
  key, record_string = reader.read(filename_queue)
  example, label = tf.some_decoder(record_string)
  processed_example = some_processing(example)
  return processed_example, label

def input_pipeline(filenames, batch_size, num_epochs=None):
  filename_queue = tf.train.string_input_producer(
      filenames, num_epochs=num_epochs, shuffle=True)
  example, label = read_my_file_format(filename_queue)
  # min_after_dequeue defines how big a buffer we will randomly         sample
  #   from -- bigger means better shuffling but slower start up and     more
  #   memory used.
  # capacity must be larger than min_after_dequeue and the amount larger
  #   determines the maximum we will prefetch.  Recommendation:
  #   min_after_dequeue + (num_threads + a small safety margin) *     batch_size
  min_after_dequeue = 10000
  capacity = min_after_dequeue + 3 * batch_size
  example_batch, label_batch = tf.train.shuffle_batch(
      [example, label], batch_size=batch_size, capacity=capacity,
      min_after_dequeue=min_after_dequeue)
  return example_batch, label_batch

I understand that this is asynchronous and the code blocks until it receives everything. When I look at the value of example and label after the code runs, I see that each only hold the information of one specific record in the data.

Question 2: Is the code under "read_my_file" supposed to be the same as the first code block I posted? And then is the input_pipeline function what batches the individual records together to a certain batch_size? If the read_my_file function is the same as the first code block, why isn't there the same for loop (which goes back to my first question)

I'd appreciate any clarification since this is my first time using TensorFlow. Thanks for the help!

dga · Accepted Answer

(1) 1200 is arbitrary - we should fix the example to have a named constant used there to make it more clear. Thanks for spotting it. :) With the way the CSV reading example is set up, continued reads will read through the two CSV files as many times as desired (the string_input_producer holding the filenames does not have a num_epochs argument supplied, so it defaults to cycling forever). So 1200 is simply the number of records the programmer has chosen to retrieve in the example.

If you want to only read the number of examples in the files, you can catch the OutOfRangeError that gets thrown if the inputters run out of inputs, or read exactly the number of records present. There's a new read op in progress that should also help make it easier, but I don't think it's included in 0.9.

(2) It's supposed to set up a very similar set of ops, but not actually DO the reading. Remember that most of what you write in Python is constructing a graph, which is a sequence of ops that TensorFlow will execute. So the stuff in read_my_file is pretty much the stuff BEFORE the tf.Session() is created. In the top example, the code within the for loop is actually executing the tf graph to extract examples back into python. But in the second part of the example, you're just setting up the plumbing to read the items into Tensors, and then adding additional ops that consume those tensors and do something useful - in this case, throwing them into a queue to create larger batches, which are themselves likely destined for later consumption by other TF ops.

Read in Large CSV File and feed into TensorFlow

Answers (1)

Related Questions