TensorFlow - Unable to save checkpoint: Python exit code 139

Question

I am trying to create and train a network for identifying regions within an image. I've based my network off of the deep learning MNIST tutorial, however, I have left out one of the fully-connected layers (for now).

My network (the dimensions reflect that I'm training input images which are 128 x 128 pixels):

wConv1 = self.weightVariable([5, 5, 1, 32])
bConv1 = self.biasVariable([32])

x = tf.placeholder(tf.float32, [None, 16384])
yPrime = tf.placeholder(tf.float32, [None, 16384])
xImage = tf.reshape(self.x, [-1, 128, 128, 1])

#Convolutional Layer 1
hConv1 = tf.nn.relu(conv2d(xImage, wConv1) + bConv1)
hPool1 = maxPool(hConv1, 2)

# Convolutional Layer 2
wConv2 = weightVariable([5, 5, 32, 64])
bConv2 = biasVariable([64])

hConv2 = tf.nn.relu(conv2d(hPool1, wConv2) + bConv2)
hPool2 = maxPool(hConv2, 2)

# Fully Connected Layer
wFc1 = weightVariable([32 * 32 * 64, 16384])
bFc1 = biasVariable([16384])

hPool2Flat = tf.reshape(hPool2, [-1, 32 * 32 * 64])
hFc1 = tf.nn.relu(tf.matmul(hPool2Flat, wFc1) + bFc1)

# Dropout layer
keepProb = tf.placeholder(tf.float32)
hFc1Drop = tf.nn.dropout(hFc1, keepProb)

# Readout layer
y = tf.nn.softmax(tf.matmul(hPool2Flat, wFc1) + bFc1)

# Training
crossEntropy = tf.reduce_mean(-tf.reduce_sum(yPrime * tf.log(y + 1e-10), reduction_indices=[1]))
trainStep = tf.train.AdamOptimizer(learningRate).minimize(crossEntropy)

# Evaluation
p = tf.placeholder(tf.float32, [None, 16384])
q = tf.placeholder(tf.float32, [None, 16384])

correctPrediction = tf.equal(p, q)
accuracy = tf.reduce_mean(tf.cast(correctPrediction, tf.float32))

saver = tf.train.Saver(tf.trainable_variables())

# Additional functions used:
def weightVariable(self, shape, name):
    initial = tf.truncated_normal(shape, stddev=0.1, name=name)
    return tf.Variable(initial)

def biasVariable(self, shape, name):
    initial = tf.constant(0.1, shape=shape, name=name)
    return tf.Variable(initial)

def conv2d(self, x, W):
    return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')

def maxPool(self, x, poolSize):
    return tf.nn.max_pool(x, ksize=[1, poolSize, poolSize, 1], strides=[1, poolSize, poolSize, 1], padding='SAME')

# Saves the current state of the network to disk
def saveNetwork(self, step=None):
    folder = os.path.dirname(self.saverPath)
        if not os.path.isdir(folder):
            os.mkdir(folder)

        if step is None:
            self.saver.save(self.sess, self.saverPath, write_meta_graph=False)
        else:
            self.saver.save(self.sess, self.saverPath, global_step=step, write_meta_graph=False)

I am able to initialize and train the network just fine; I've run it for 10000 iterations, monitored the progress, and verified the output images are consistent with what I expect. My problem is that I am unable to save the model either at the end of the training or at checkpoints during the training. When the call to save the graph is executed, the python script hangs and after a few minutes exits with code 139, which, from what I understand, is related to being out of memory or trying to access memory which is unavailable.

I also have created a single layer network (based on the MNIST tutorial) which trains fine. I am able to save the graph at checkpoints and after the training has completed.

I've done a rough calculation and the graph variables should use up about 4 GB of memory, although I am aware that TensorFlow does consume way more memory than expected. I'm running Ubuntu 16.04 and the PC has 64 GB of RAM. During the training and saving, the process peaks at consuming 24 GB of memory (according to the resource monitor) which is still well below the amount available. I've also replicated this using Ubuntu 14.04.

I've tried smaller batch sizes hoping to reduce the memory footprint, even down to just 4 images per step. It is still unable to save a checkpoint.

I'm still fairly new to TensorFlow so I'm not sure where to look next and could use some advice. As you can see, I've set the saver to only save the training variables hoping that would reduce the size of the file it's trying to save (Doing so reduced the size of the graph file for my simple network from 4 GB to 2 GB). Is it that I'm trying to save too large of a file to disk (hard drive space shouldn't be an issue, the drive it's saving to is a 2 TB hard disk)? Is Python not able to handle that large of a file in memory when trying to write to disk? Am I even on the right track thinking this is a memory issue since Python is exiting with code 139?

mrry · Accepted Answer

It looks like your process is crashing with a segmentation fault. The tf.train.Saver.save() method calls some C++ code that serializes all of your variables to a file. This serialization format has a 2GB limit for the largest tensor (because it serializes each variable into a Protocol Buffer, which has a 2GB maximum record size). Your weight variable wFc1 is 4GB in size. I suspect the failure is happening around this line; the fact that it crashes this way is a bug.

One possible solution would be to shard your large variable. For example:

wFc1_0 = weight_variable([32 * 32 * 64, 4096])
wFc1_1 = weight_variable([32 * 32 * 64, 4096])
wFc1_2 = weight_variable([32 * 32 * 64, 4096])
wFc1_3 = weight_variable([32 * 32 * 64, 4096])

# ...

# Readout layer
y = tf.nn.softmax(
    tf.concat(1, [tf.matmul(hPool2Flat, wFc1_0),
                  tf.matmul(hPool2Flat, wFc1_1),
                  tf.matmul(hPool2Flat, wFc1_2),
                  tf.matmul(hPool2Flat, wFc1_3),
                 ]) + bFc1)

This might not be the most efficient sharding, so it pays to experiment here. Since you have a large number of classes, you might find some of TensorFlow's sampled loss functions—which support sharded weights—more efficient.

TensorFlow - Unable to save checkpoint: Python exit code 139

Answers (1)

Related Questions