MindSeeker
MindSeeker

Reputation: 600

Tensorflow Deep Learning Memory Leak?

I am doing GPU-accelerated deep learning with Tensorflow, and am experiencing a memory leak (the RAM variety, not on the GPU).

I have narrowed it down, almost beyond all doubt, to the training line

self.sess.run(self.train_step, feed_dict={self.x: trainingdata, self.y_true: traininglabels, self.keepratio: self.training_keep_rate})

If I comment that line, and only that line, out (but still do all my pre-processing and validation/testing and such for a few thousand training batches), the memory leak does not happen.

The leak is on the order of a few GB per hour (I am running Ubuntu, and have 16GB RAM + 16GB swap; the system becomes very laggy and unresponsive after 1-3 hours of running, when about 1/3-1/2 the RAM is used, which is a bit weird to me since I still have lots of RAM and the CPU is mostly free when this happens...)

Here is some of the initializer code (only run once, at the beginning) if it is relevant:

    with tf.name_scope('after_final_layer') as scope:
        self.layer1 = weights["wc1"]
        self.y_conv = network(self.x, weights, biases, self.keepratio)['out']
        variable_summaries(self.y_conv)
        # Note: Don't add a softmax reducer in the network if you are going to use this
        # cross-entropy function
        self.cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(self.y_conv, self.y_true, name = "softmax/cross_ent"), name = "reduce_mean")
        self.train_step = tf.train.AdamOptimizer(learning_rate, name = "Adam_Optimizer").minimize(self.cross_entropy)

        self.prediction = tf.argmax(self.y_conv, 1)
        self.correct_prediction = tf.equal(self.prediction, tf.argmax(self.y_true, 1))

        self.accuracy = tf.reduce_mean(tf.cast(self.correct_prediction, tf.float32))

        if tensorboard:
            # Merge all the summaries and write them out to the directory below
            self.merged = tf.summary.merge_all()
            self.my_writer = tf.summary.FileWriter('/home/james/PycharmProjects/AI_Final/my_tensorboard', graph=self.sess.graph)

        # self.sess.run(tf.initialize_all_variables()) #old outdated way to do below
        tf.global_variables_initializer().run(session=self.sess)

I'm also happy to post all of the network/initialization code, but I think that that is probably irrelevant to this leak.

Am I doing something wrong or have I found a Tensorflow bug? Thanks in advance!

Update: I will likely submit a bug report soon, but I am first trying to verify that I am not bothering them with my own mistakes. I have added

self.sess.graph.finalize()

to the end of my initialization code. As I understand it, it should throw an exception if I am accidentally adding to the graph. No exceptions are thrown. I am using tf version 0.12.0-rc0, np version 1.12.0b1, and Python version 2.7.6. Could those versions be outdated/the problem?

Upvotes: 2

Views: 1156

Answers (1)

MindSeeker
MindSeeker

Reputation: 600

This issue is solved in 1.1. Ignore this page which (at the time of writing) says that the latest stable version is r0.12; 1.1 is the latest stable version. See https://github.com/tensorflow/tensorflow/issues/9590 and https://github.com/tensorflow/tensorflow/issues/9872

Upvotes: 1

Related Questions