secsilm
secsilm

Reputation: 430

What should I do if I want to use large datasets that can't load into the memory with TensorFlow?

I want to use a large dataset that cannot load into the memory once to train a model with TensorFlow. But I don't know what exacty I should do.

I have read some great posts about TFRecords file format and the official document. Bus I still can't figure it out.

Is there a complete solution plan with TensorFlow?

Upvotes: 1

Views: 952

Answers (2)

Insectatorious
Insectatorious

Reputation: 1335

Consider using tf.TextLineReader which in conjunction with tf.train.string_input_producer allows you to load data from multiple files on disk (if your dataset is large enough that it needs to be spread out into multiple files).

See https://www.tensorflow.org/programmers_guide/reading_data#reading_from_files

Code snippet from the link above:

filename_queue = tf.train.string_input_producer(["file0.csv", "file1.csv"])

reader = tf.TextLineReader()
key, value = reader.read(filename_queue)

# Default values, in case of empty columns. Also specifies the type of the
# decoded result.
record_defaults = [[1], [1], [1], [1], [1]]
col1, col2, col3, col4, col5 = tf.decode_csv(
    value, record_defaults=record_defaults)
features = tf.stack([col1, col2, col3, col4])

with tf.Session() as sess:
  # Start populating the filename queue.
  coord = tf.train.Coordinator()
  threads = tf.train.start_queue_runners(coord=coord)

  for     filename_queue = tf.train.string_input_producer(["file0.csv", "file1.csv"])

reader = tf.TextLineReader()
key, value = reader.read(filename_queue)

# Default values, in case of empty columns. Also specifies the type of the
# decoded result.
record_defaults = [[1], [1], [1], [1], [1]]
col1, col2, col3, col4, col5 = tf.decode_csv(
    value, record_defaults=record_defaults)
features = tf.stack([col1, col2, col3, col4])

with tf.Session() as sess:
  # Start populating the filename queue.
  coord = tf.train.Coordinator()
  threads = tf.train.start_queue_runners(coord=coord)

  for i in range(1200):
    # Retrieve a single instance:
    example, label = sess.run([features, col5])

  coord.request_stop()
  coord.join(threads)i in range(1200):
    # Retrieve a single instance:
    example, label = sess.run([features, col5])

  coord.request_stop()
  coord.join(threads)

Upvotes: 2

Thomas Pinetz
Thomas Pinetz

Reputation: 7148

Normally you use a batch wise training anyways so you can load the data on the fly. For example for images:

for bid in nrBatches:
     batch_x, batch_y = load_data_from_hd(bid)
     train_step.run(feed_dict={x: batch_x, y_: batch_y})

So you load every batch on the fly and only load the data which you need to load at any given moment. Naturally your training time will increase while using the harddisk instead of memory to load data.

Upvotes: 1

Related Questions