Reputation: 430
I want to use a large dataset that cannot load into the memory once to train a model with TensorFlow. But I don't know what exacty I should do.
I have read some great posts about TFRecords
file format and the official document. Bus I still can't figure it out.
Is there a complete solution plan with TensorFlow?
Upvotes: 1
Views: 952
Reputation: 1335
Consider using tf.TextLineReader
which in conjunction with tf.train.string_input_producer
allows you to load data from multiple files on disk (if your dataset is large enough that it needs to be spread out into multiple files).
See https://www.tensorflow.org/programmers_guide/reading_data#reading_from_files
Code snippet from the link above:
filename_queue = tf.train.string_input_producer(["file0.csv", "file1.csv"])
reader = tf.TextLineReader()
key, value = reader.read(filename_queue)
# Default values, in case of empty columns. Also specifies the type of the
# decoded result.
record_defaults = [[1], [1], [1], [1], [1]]
col1, col2, col3, col4, col5 = tf.decode_csv(
value, record_defaults=record_defaults)
features = tf.stack([col1, col2, col3, col4])
with tf.Session() as sess:
# Start populating the filename queue.
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord)
for filename_queue = tf.train.string_input_producer(["file0.csv", "file1.csv"])
reader = tf.TextLineReader()
key, value = reader.read(filename_queue)
# Default values, in case of empty columns. Also specifies the type of the
# decoded result.
record_defaults = [[1], [1], [1], [1], [1]]
col1, col2, col3, col4, col5 = tf.decode_csv(
value, record_defaults=record_defaults)
features = tf.stack([col1, col2, col3, col4])
with tf.Session() as sess:
# Start populating the filename queue.
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord)
for i in range(1200):
# Retrieve a single instance:
example, label = sess.run([features, col5])
coord.request_stop()
coord.join(threads)i in range(1200):
# Retrieve a single instance:
example, label = sess.run([features, col5])
coord.request_stop()
coord.join(threads)
Upvotes: 2
Reputation: 7148
Normally you use a batch wise training anyways so you can load the data on the fly. For example for images:
for bid in nrBatches:
batch_x, batch_y = load_data_from_hd(bid)
train_step.run(feed_dict={x: batch_x, y_: batch_y})
So you load every batch on the fly and only load the data which you need to load at any given moment. Naturally your training time will increase while using the harddisk instead of memory to load data.
Upvotes: 1