jojo
jojo

Reputation: 66

Training huge amounts of data with tensorflow

I have about 60 thousand samples of size 200x870, they are all numpy arrays and I want to build a four-dimensional tensor out of them (with one singleton dimension) and train them with a CNN in tensorflow. Up to this point, I was using data that I could just load and create batches as below:

 with tf.Graph().as_default(): 
     data_train = tf.to_float(getInput.data_train)
     phase, lr = tf.placeholder(tf.bool),  tf.placeholder(tf.float32)
     global_step = tf.Variable(0,trainable = False)
     image_train, label_train = tf.train.slice_input_producer([data_train, labels_train], num_epochs=args.num_epochs)
     images_train, batch_labels_train = tf.train.batch([image_train, label_train], batch_size=args.bsize) 

Can someone suggest a way to go around it?

I wanted to split the dataset into subsets and in one epoch train one after the ather using a Queue for the paths of this files:

import scipy.io as sc
import numpy as np
import threading
import time

import tensorflow as tf
from tensorflow.python.client import timeline

def testQueues():

    paths = ['data1', 'data2', 'data3', 'data4','data5']
    queue_capacity = 6
    bsize = 10
    num_epochs = 2
    filename_queue = tf.FIFOQueue(
        #min_after_dequeue=0,
        capacity=queue_capacity,
        dtypes=tf.string,
        shapes=[[]]
    )
    filenames_placeholder = tf.placeholder(dtype='string', shape=(None))
    filenames_enqueue_op = filename_queue.enqueue_many(filenames_placeholder)
    data_train, phase  = tf.placeholder(tf.float32), tf.placeholder(tf.bool)




    sess= tf.Session()
    sess.run(filenames_enqueue_op, feed_dict={filenames_placeholder: paths})

    for i in range(len(paths)):



        train_set_batch_name = sess.run(filename_queue.dequeue())
        train_set_batch_name = train_set_batch_name.decode('utf-8')
        train_set_batch = np.load(train_set_batch_name+'.npy')
        train_set_batch = tf.cast(train_set_batch, tf.float32)
        init_op = tf.group(tf.initialize_all_variables(), tf.initialize_local_variables())
        sess.run(init_op)
        run_one_epoch(train_set_batch, sess)

        size = sess.run(filename_queue.size())
        print(size)
        print(train_set_batch)


def run_one_epoch(train_set,sess):
    image_train = tf.train.slice_input_producer([train_set], num_epochs=1)
    images_train = tf.train.batch(image_train, batch_size=10)
    x = tf.nn.relu(images_train)
    coord = tf.train.Coordinator()
    threads = tf.train.start_queue_runners(sess=sess, coord=coord)
    try:
        while not coord.should_stop():
            sess.run(x)
    except tf.errors.OutOfRangeError:
        pass
    finally:
      # When done, ask the threads to stop.
        coord.request_stop()        
        coord.join(threads)


testQueues()

However I get an error

FailedPreconditionError: Attempting to use uninitialized value input_producer/input_producer/fraction_of_32_full/limit_epochs/epochs
     [[Node: input_producer/input_producer/fraction_of_32_full/limit_epochs/CountUpTo = CountUpTo[T=DT_INT64, _class=["loc:@input_producer/input_producer/fraction_of_32_full/limit_epochs/epochs"], limit=1, _device="/job:localhost/replica:0/task:0/cpu:0"](input_producer/input_producer/fraction_of_32_full/limit_epochs/epochs)]]

Also it seems as I can't feed the dictionary with a tf.tensor only with numpy array, but casting it later to tf.tensor is also troublesome.

Upvotes: 0

Views: 737

Answers (2)

anand_v.singh
anand_v.singh

Reputation: 2838

Have a look at Dataset api. "The tf.data API enables you to build complex input pipelines from simple, reusable pieces."

In this approach what you do is you model your graph such that it handles data for you and pulls in limited data at a time for you to train your model on.

If memory issue still persists then you might want to look into generator to create your tf.data.Dataset. Your next step could be to potentially speed up the process by preparing tfrecords to create you Dataset.

Follow all the links to learn more and feel free to comment if you don't understand something.

Upvotes: 3

Yaroslav Bulatov
Yaroslav Bulatov

Reputation: 57893

For data that doesn't fit into memory the standard solution is to use Queues. You can set up some ops that read from files directly (cvs files, image files), and feed them into TensorFlow -- https://www.tensorflow.org/versions/r0.11/how_tos/reading_data/index.html

Upvotes: 1

Related Questions