Reputation: 894
I am trying to use Checkpoint
for my model. Before that tried this with a toy example. This runs with no errors. But every time I run, looks like the training parameter starts from the initial value. Not sure if I am missing something here? Following is the code im using:
import numpy as np
import tensorflow as tf
X = tf.range(10.)
Y = 50.*X
class CGMM(object):
def __init__(self):
self.beta = tf.Variable(1. , dtype=np.float32)
@tf.function
def objfun(self):
beta = self.beta
obj = tf.reduce_mean(tf.square(beta*self.X - self.Y))
return obj
def build_model(self,X,Y):
self.X,self.Y=X,Y
optimizer = tf.keras.optimizers.RMSprop(0.5)
ckpt = tf.train.Checkpoint(step=tf.Variable(1),model =self.objfun ,optimizer=optimizer)
manager = tf.train.CheckpointManager(ckpt, './tf_ckpts_cg', max_to_keep=3)
ckpt.restore(manager.latest_checkpoint)
if manager.latest_checkpoint:
print("Restored from {}".format(manager.latest_checkpoint))
else:
print("Initializing from scratch.")
for i in range(20):
optimizer.minimize(self.objfun,var_list = self.beta)
loss, beta = self.objfun(), self.beta
# print(self.beta.numpy())
ckpt.step.assign_add(1)
if int(ckpt.step) % 5 == 0:
save_path = manager.save()
print("Saved checkpoint for step {}: {}".format(int(ckpt.step), save_path))
print("loss {:1.2f}".format(loss.numpy()))
print("beta {:1.2f}".format(beta.numpy()))
return beta
model =CGMM()
opt_beta = model.build_model(X,Y)
Results 1st run:
Initializing from scratch.
Saved checkpoint for step 5: ./tf_ckpts_cg/ckpt-1
loss 56509.74
beta 5.47
Saved checkpoint for step 10: ./tf_ckpts_cg/ckpt-2
loss 48354.54
beta 8.81
Saved checkpoint for step 15: ./tf_ckpts_cg/ckpt-3
loss 42085.54
beta 11.57
Saved checkpoint for step 20: ./tf_ckpts_cg/ckpt-4
loss 36750.57
beta 14.09
Results 2nd run:
Restored from ./tf_ckpts_cg/ckpt-4
Saved checkpoint for step 25: ./tf_ckpts_cg/ckpt-5
loss 54619.16
beta 6.22
Saved checkpoint for step 30: ./tf_ckpts_cg/ckpt-6
loss 46997.79
beta 9.39
Saved checkpoint for step 35: ./tf_ckpts_cg/ckpt-7
loss 40958.30
beta 12.09
Saved checkpoint for step 40: ./tf_ckpts_cg/ckpt-8
loss 35763.21
beta 14.58
Upvotes: 1
Views: 696
Reputation: 11631
What's the issue :
The main issue is that your beta
variable is not trackable: it means that the checkpoint object will not save it. We can see that by inspecting the content of the checkpoint with the following function :
>>> tf.train.list_variables(tf.train.latest_checkpoint('./tf_ckpts_cg/')
[('_CHECKPOINTABLE_OBJECT_GRAPH', []),
('optimizer/decay/.ATTRIBUTES/VARIABLE_VALUE', []),
('optimizer/iter/.ATTRIBUTES/VARIABLE_VALUE', []),
('optimizer/learning_rate/.ATTRIBUTES/VARIABLE_VALUE', []),
('optimizer/momentum/.ATTRIBUTES/VARIABLE_VALUE', []),
('optimizer/rho/.ATTRIBUTES/VARIABLE_VALUE', []),
('save_counter/.ATTRIBUTES/VARIABLE_VALUE', []),
('step/.ATTRIBUTES/VARIABLE_VALUE', [])]
The only tf.Variable
s tracked by the checkpoint are the one from the optimizer and those used by the tf.train.Checkpoint
object itself.
A possible solution :
To change that, you need to track your variable. The TensorFlow documentation about that subject is not great, but after searching a bit, you can read the following in the tf.Variable
documentation :
Variables are automatically tracked when assigned to attributes of types inheriting from tf.Module.
[...]
This tracking then allows saving variable values to training checkpoints, or to SavedModels which include serialized TensorFlow graphs.
So, by making your CGMM class inherits from tf.Module
, you can track your beta
variable, and restore it! Here's a really straightforward change to your code :
class CGMM(tf.Module):
def __init__(self):
super(CGMM, self).__init__(name='CGMM')
self.beta = tf.Variable(1. , dtype=np.float32)
We also need to tell the Checkpoint
object that the model is now the CGMM object:
ckpt = tf.train.Checkpoint(step=tf.Variable(1) ,model=self, optimizer=optimizer)
Now if we train for a few steps and look at the content of the checkpoint file we get something promising. The beta variable is now saved:
>>> tf.train.list_variables(tf.train.latest_checkpoint('./tf_ckpts_cg/'))
[('_CHECKPOINTABLE_OBJECT_GRAPH', []),
('model/beta/.ATTRIBUTES/VARIABLE_VALUE', []),
('model/beta/.OPTIMIZER_SLOT/optimizer/rms/.ATTRIBUTES/VARIABLE_VALUE', []),
('optimizer/decay/.ATTRIBUTES/VARIABLE_VALUE', []),
('optimizer/iter/.ATTRIBUTES/VARIABLE_VALUE', []),
('optimizer/learning_rate/.ATTRIBUTES/VARIABLE_VALUE', []),
('optimizer/momentum/.ATTRIBUTES/VARIABLE_VALUE', []),
('optimizer/rho/.ATTRIBUTES/VARIABLE_VALUE', []),
('save_counter/.ATTRIBUTES/VARIABLE_VALUE', []),
('step/.ATTRIBUTES/VARIABLE_VALUE', [])]
And if we run the program a few times, we get:
>>> run tf-ckpt.py
Restored from ./tf_ckpts_cg/ckpt-28
Saved checkpoint for step 145: ./tf_ckpts_cg/ckpt-29
loss 0.00
beta 49.99
Saved checkpoint for step 150: ./tf_ckpts_cg/ckpt-30
loss 0.00
beta 50.00
Hurray!
Note: In order to track variables, you can also use any kind of keras.layers.Layer
as well as any keras.Model
. It is probably the easiest way.
An extract from the Training Checkpoint guide :
Subclasses of tf.train.Checkpoint, tf.keras.layers.Layer, and tf.keras.Model automatically track variables assigned to their attributes.
Upvotes: 2