How to resume properly the training of a network from a tensorflow checkpoint file?

Question

I am struggling to restore a model for one day without any success. My code consists of a class TF_MLPRegressor(), where I define the network architecture within the constructor. Then I invoke the fit() function to do the training. So this is how I save a simple Perceptron model with 1 hidden layer within the fit() function:

            starting_epoch = 0
            # Launch the graph
            tf.set_random_seed(self.random_state)   # fix the random seed before creating the Session in order to take effect!
            if hasattr(self, 'sess'):
                self.sess.close()
                del self.sess   # delete Session to release memory
                gc.collect()
            self.sess = tf.Session(config=self.config) # save the session to predict from new data
            # Create a saver object which will save all the variables
            saver = tf.train.Saver(max_to_keep=2)  # max_to_keep=2 means to not keep more than 2 checkpoint files
            self.sess.run(tf.global_variables_initializer())

# ... (each 100 epochs)

            saver.save(self.sess, self.checkpoint_dir+"/resume", global_step=epoch)

Then I create a new TF_MLPRegressor() instance with exactly the same input parameter values and invoke the fit() function to restore the model like this:

    self.sess = tf.Session(config=self.config)  # create a new session to load saved variables
    ckpt = tf.train.latest_checkpoint(self.checkpoint_dir)
    starting_epoch = int(ckpt.split('-')[-1])
    metagraph = ".".join([ckpt, 'meta'])
    saver = tf.train.import_meta_graph(metagraph)
    self.sess.run(tf.global_variables_initializer())    # Initialize variables
    lhl = tf.trainable_variables()[2]
    lhlA = lhl.eval(session=self.sess)
    saver.restore(sess=self.sess, save_path=ckpt)   # Restore model weights from previously saved model
    lhlB = lhl.eval(session=self.sess)
    print lhlA == lhlB

lhlA and lhlB are the last hidden layer weights before and after restoring and according to my code they match completely, namely the saved model is not loaded to the session. What am I doing wrong?

How to resume properly the training of a network from a tensorflow checkpoint file?

Answers (1)

Related Questions