Achilles
Achilles

Reputation: 1129

Tensorflow "DataLossError: Checksum does not match" when used with supervisor

I've been using tensorflow with tf.train.Supervisor -

sv = tf.train.Supervisor(logdir=path, save_model_secs=900)
with sv.managed_session() as sess:
    if not sv.should_stop(): 
        #Rest of the code

Recently, it crashed during training and since then it has been throwing the below error at the with sv.managed_session() line above -

DataLossError (see above for traceback): Checksum does not match: stored 1057608875 vs. calculated on the restored bytes 763056116

[[Node: save/RestoreV2_31 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_recv_save/Const_0, save/RestoreV2_31/tensor_names, save/RestoreV2_31/shape_and_slices)]]

Is it possible to fix it?

Upvotes: 4

Views: 8033

Answers (1)

Alexandre Passos
Alexandre Passos

Reputation: 5206

This means your checkpoint file got corrupted. Delete the latest version (i.e. the one with the largest global_step number) and try again and it should work.

Upvotes: 5

Related Questions