Tensorflow tf.train.Saver not saving all variables

Question

I thought that Tensorflow saver will save all variables as stated here

If you do not pass any arguments to tf.train.Saver(), the saver handles all variables in the graph. Each variable is saved under the name that was passed when the variable was created.

https://www.tensorflow.org/programmers_guide/saved_model

However, the variable epochCount in my code below does not seem to get saved. This variable is used to keep track of the total epoches the model has trained over the data.

When I restore a graph it resets to it's initializer value, not the value it was when it the check point was last saved.

It appears to me that it's only saving variables used in calculating the loss.

Here's my code.

This is where I declare my graph:

graph = tf.Graph()

with graph.as_default(): 

  valid_examples = np.array(random.sample(range(1, valid_window), valid_size)) #put inside graph to get new words each time

  train_dataset = tf.placeholder(tf.int32, shape=[batch_size, cbow_window*2 ])
  train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
  valid_dataset = tf.constant(valid_examples, dtype=tf.int32)
  valid_datasetSM = tf.constant(valid_examples, dtype=tf.int32)

  epochCount = tf.get_variable( 'epochCount', initializer= 0) #to store epoch count to total # of epochs are known

  embeddings = tf.get_variable( 'embeddings', 
    initializer= tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))

  softmax_weights = tf.get_variable( 'softmax_weights',
    initializer= tf.truncated_normal([vocabulary_size, embedding_size],
                         stddev=1.0 / math.sqrt(embedding_size)))
  softmax_biases = tf.get_variable('softmax_biases', 
    initializer= tf.zeros([vocabulary_size]),  trainable=False )

  embed = tf.nn.embedding_lookup(embeddings, train_dataset) #train data set is
  embed_reshaped = tf.reshape( embed, [batch_size*cbow_window*2, embedding_size] )
  segments= np.arange(batch_size).repeat(cbow_window*2)
  averaged_embeds = tf.segment_mean(embed_reshaped, segments, name=None)

  loss = tf.reduce_mean(
    tf.nn.sampled_softmax_loss(weights=softmax_weights, biases=softmax_biases, inputs=averaged_embeds,
                               labels=train_labels, num_sampled=num_sampled, num_classes=vocabulary_size))

  optimizer = tf.train.AdagradOptimizer(1.0).minimize(loss) #Original learning rate was 1.0

  norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keepdims=True))
  normalized_embeddings = embeddings / norm
  valid_embeddings = tf.nn.embedding_lookup(
    normalized_embeddings, valid_dataset) 
  similarity = tf.matmul(valid_embeddings, tf.transpose(normalized_embeddings)) 

  saver = tf.train.Saver()

If I restore the graph from checkpoint, the embeddings, and softmax_biases are restored, but epochCount is reset to its initializer value. (Note that I am not calling the tf.global_variables_initializer().run() line, which is a common cause of variables mistakenly being reset after a checkpoint has been restored)

Here is the code where the graph is run

num_steps = 1000001

with tf.Session(graph=graph) as session:

  saver.restore(session, './checkpointsBook2VecCbowWindow2Downloaded/bookVec.ckpt' )
  average_loss = 0
  saveIteration = 1
  for step in range(1, num_steps):

    batch_data, batch_labels = generate_batch(
      batch_size, cbow_window)
    feed_dict = {train_dataset : batch_data, train_labels : batch_labels}
    _, l = session.run([optimizer, loss], feed_dict=feed_dict) 

    if step % 20000 == 0:
      recEpoch_indexA =  epoch_index - recEpoch_indexA
      epochCount = tf.add(  epochCount, recEpoch_indexA, name=None )
      recEpoch_indexA = epoch_index

      save_path = saver.save(session, "checkpointsBook2VecCbowWindow2/bookVec.ckpt") 
      chptName = 'B2VCbowW2Embed256ckpt'+str(saveIteration)
      zipfolder(chptName, 'checkpointsBook2VecCbowWindow2')
      uploadModel.SetContentFile(chptName+".zip")
      uploadModel.Upload()

      print("Checkpoint uploaded to Google Drive")
      saveIteration += 1

This is the code I use to print out all the variables saved in a checkpoint after training. I restore the graph and print out all the variables saved.

with tf.Session() as sess:
  saver = tf.train.import_meta_graph('./MODEL/bookVec.ckpt.meta')
  saver.restore(sess, './MODEL/bookVec.ckpt' )
  for v in tf.get_default_graph().get_collection("variables"):
    print('From variables collection ', v)

And this is the output from the code above

From variables collection  
From variables collection  
From variables collection

As seen, epochCount has not been saved.

xdurch0 · Accepted Answer

The reason the variable is restored as 0 is because it is actually never updated (i.e. it is restored correctly)! You are overwriting epochCount by the tf.add call during the session, which only returns the operation, no actual value. That is, the variable (in the Tensorflow sense) is "orphaned" and will stay at 0 forever.

You could use tf.assign to update the variable instead. It could look something like this:

# where you define the graph
epochCount = tf.get_variable( 'epochCount', initializer= 0)
update_epoch = tf.assign(epochCount, epochCount + 1)
...
# after you launched the session
for step in range(1, num_steps):
    if step % 20000 == 0:
        sess.run(update_epoch)

Tensorflow tf.train.Saver not saving all variables

Answers (1)

Related Questions