TensorFlow Checkpoint variables not saved

Question

I am trying to use Checkpoint for my model. Before that tried this with a toy example. This runs with no errors. But every time I run, looks like the training parameter starts from the initial value. Not sure if I am missing something here? Following is the code im using:

import numpy as np
import tensorflow as tf

X = tf.range(10.)
Y = 50.*X
    
class CGMM(object):
    def __init__(self):
        self.beta =  tf.Variable(1. , dtype=np.float32)

    @tf.function
    def objfun(self):
        beta = self.beta
        obj = tf.reduce_mean(tf.square(beta*self.X - self.Y))
        return obj

    def build_model(self,X,Y):
        self.X,self.Y=X,Y
        optimizer = tf.keras.optimizers.RMSprop(0.5)
        ckpt = tf.train.Checkpoint(step=tf.Variable(1),model =self.objfun ,optimizer=optimizer)
        manager = tf.train.CheckpointManager(ckpt, './tf_ckpts_cg', max_to_keep=3)

        ckpt.restore(manager.latest_checkpoint)
        if manager.latest_checkpoint:
            print("Restored from {}".format(manager.latest_checkpoint))
        else:
            print("Initializing from scratch.")

        for i in range(20):
            optimizer.minimize(self.objfun,var_list =  self.beta)
            loss, beta = self.objfun(), self.beta
            # print(self.beta.numpy())
            ckpt.step.assign_add(1)
            if int(ckpt.step) % 5 == 0:
              save_path = manager.save()
              print("Saved checkpoint for step {}: {}".format(int(ckpt.step), save_path))
              print("loss {:1.2f}".format(loss.numpy()))
              print("beta {:1.2f}".format(beta.numpy()))

        return beta


model =CGMM()
opt_beta = model.build_model(X,Y)

Results 1st run:

Initializing from scratch.
Saved checkpoint for step 5: ./tf_ckpts_cg/ckpt-1
loss 56509.74
beta 5.47
Saved checkpoint for step 10: ./tf_ckpts_cg/ckpt-2
loss 48354.54
beta 8.81
Saved checkpoint for step 15: ./tf_ckpts_cg/ckpt-3
loss 42085.54
beta 11.57
Saved checkpoint for step 20: ./tf_ckpts_cg/ckpt-4
loss 36750.57
beta 14.09

Results 2nd run:

Restored from ./tf_ckpts_cg/ckpt-4
Saved checkpoint for step 25: ./tf_ckpts_cg/ckpt-5
loss 54619.16
beta 6.22
Saved checkpoint for step 30: ./tf_ckpts_cg/ckpt-6
loss 46997.79
beta 9.39
Saved checkpoint for step 35: ./tf_ckpts_cg/ckpt-7
loss 40958.30
beta 12.09
Saved checkpoint for step 40: ./tf_ckpts_cg/ckpt-8
loss 35763.21
beta 14.58

Lescurel · Accepted Answer

What's the issue :

The main issue is that your beta variable is not trackable: it means that the checkpoint object will not save it. We can see that by inspecting the content of the checkpoint with the following function :

>>> tf.train.list_variables(tf.train.latest_checkpoint('./tf_ckpts_cg/')

[('_CHECKPOINTABLE_OBJECT_GRAPH', []),
 ('optimizer/decay/.ATTRIBUTES/VARIABLE_VALUE', []),
 ('optimizer/iter/.ATTRIBUTES/VARIABLE_VALUE', []),
 ('optimizer/learning_rate/.ATTRIBUTES/VARIABLE_VALUE', []),
 ('optimizer/momentum/.ATTRIBUTES/VARIABLE_VALUE', []),
 ('optimizer/rho/.ATTRIBUTES/VARIABLE_VALUE', []),
 ('save_counter/.ATTRIBUTES/VARIABLE_VALUE', []),
 ('step/.ATTRIBUTES/VARIABLE_VALUE', [])]

The only tf.Variables tracked by the checkpoint are the one from the optimizer and those used by the tf.train.Checkpoint object itself.

A possible solution :

To change that, you need to track your variable. The TensorFlow documentation about that subject is not great, but after searching a bit, you can read the following in the tf.Variable documentation :

Variables are automatically tracked when assigned to attributes of types inheriting from tf.Module.

[...]

This tracking then allows saving variable values to training checkpoints, or to SavedModels which include serialized TensorFlow graphs.

So, by making your CGMM class inherits from tf.Module, you can track your beta variable, and restore it! Here's a really straightforward change to your code :

class CGMM(tf.Module):
    def __init__(self):
        super(CGMM, self).__init__(name='CGMM')
        self.beta =  tf.Variable(1. , dtype=np.float32)

We also need to tell the Checkpoint object that the model is now the CGMM object:

ckpt = tf.train.Checkpoint(step=tf.Variable(1) ,model=self, optimizer=optimizer)

Now if we train for a few steps and look at the content of the checkpoint file we get something promising. The beta variable is now saved:

>>> tf.train.list_variables(tf.train.latest_checkpoint('./tf_ckpts_cg/'))

[('_CHECKPOINTABLE_OBJECT_GRAPH', []),
 ('model/beta/.ATTRIBUTES/VARIABLE_VALUE', []),
 ('model/beta/.OPTIMIZER_SLOT/optimizer/rms/.ATTRIBUTES/VARIABLE_VALUE', []),
 ('optimizer/decay/.ATTRIBUTES/VARIABLE_VALUE', []),
 ('optimizer/iter/.ATTRIBUTES/VARIABLE_VALUE', []),
 ('optimizer/learning_rate/.ATTRIBUTES/VARIABLE_VALUE', []),
 ('optimizer/momentum/.ATTRIBUTES/VARIABLE_VALUE', []),
 ('optimizer/rho/.ATTRIBUTES/VARIABLE_VALUE', []),
 ('save_counter/.ATTRIBUTES/VARIABLE_VALUE', []),
 ('step/.ATTRIBUTES/VARIABLE_VALUE', [])]

And if we run the program a few times, we get:

>>> run tf-ckpt.py

Restored from ./tf_ckpts_cg/ckpt-28
Saved checkpoint for step 145: ./tf_ckpts_cg/ckpt-29
loss 0.00
beta 49.99
Saved checkpoint for step 150: ./tf_ckpts_cg/ckpt-30
loss 0.00
beta 50.00

Hurray!

Note: In order to track variables, you can also use any kind of keras.layers.Layer as well as any keras.Model. It is probably the easiest way.

An extract from the Training Checkpoint guide :

Subclasses of tf.train.Checkpoint, tf.keras.layers.Layer, and tf.keras.Model automatically track variables assigned to their attributes.

TensorFlow Checkpoint variables not saved

Answers (1)

Related Questions