user1700890
user1700890

Reputation: 7730

tensorflow - cannot restore model - "Couldn't match files for checkpoint"

Here is my model saved to disk:

import tensorflow as tf
import numpy as np


BATCH_SIZE = 3
VECTOR_SIZE = 1
LEARNING_RATE = 0.1

x = tf.placeholder(tf.float32, [BATCH_SIZE, VECTOR_SIZE],
                   name='input_placeholder')
y = tf.placeholder(tf.float32, [BATCH_SIZE, VECTOR_SIZE],
                   name='labels_placeholder')

W = tf.get_variable('W', [VECTOR_SIZE, BATCH_SIZE])
b = tf.get_variable('b', [VECTOR_SIZE], initializer=tf.constant_initializer(0.0))

y_hat = tf.matmul(W, x) + b
predict = tf.add(tf.matmul(W, x), b, name='predict')
total_loss = tf.reduce_mean(y-y_hat)
train_step = tf.train.AdagradOptimizer(LEARNING_RATE).minimize(total_loss)
X = np.ones([BATCH_SIZE, VECTOR_SIZE])
Y = np.ones([BATCH_SIZE, VECTOR_SIZE])
all_saver = tf.train.Saver() 

sess= tf.Session()
sess.run(tf.global_variables_initializer())
sess.run([train_step], feed_dict = {x: X, y:Y})
save_path = r'C:\tmp\tmp\\'
all_saver.save(sess,save_path)

While trying to restore

checkpoint_path = r'C:\tmp\tmp\\'
tf.train.latest_checkpoint(checkpoint_path)

I am getting the following error message:

ERROR:tensorflow:Couldn't match files for checkpoint C:\tmp\tmp\\

In C:\tmp\tmp\ I have the following files:

.data-00000-of-00001
.index
.meta
checkpoint

Any thoughts?

Upvotes: 1

Views: 4119

Answers (3)

David-LiCause
David-LiCause

Reputation: 11

FWIW I saw this error while training a custom estimator on AI Platform (Cloud ML Engine). The issue for me was caused by the region of the GCS bucket where I was saving the checkpoints/model metadata.

When the region of this bucket was set to us (multiple regions in United States) I saw this error during evaluation. Setting the region of the GCS bucket to the same region where the AI Platform job was running (us-central1 (Iowa) in my case) resolved the issue.

Upvotes: 1

amirbar
amirbar

Reputation: 839

From saver.save tensorflow api:

save_path: String. Path to the checkpoint filename. If the saver is sharded, this is the prefix of the sharded checkpoint filename.

In save_path you didn't specify checkpoint filename.

For future use, try setting: checkpoint_path = r'C:\tmp\tmp\my-model'.

If you want to load your previously saved model, do the following:

  1. prepend the string my-model for these files:
.data-00000-of-00001
.index
.meta
  1. modify checkpoint file such that it will point to your checkpoint:
model_checkpoint_path: "C:\tmp\tmp\my-model"
all_model_checkpoint_paths: "C:\tmp\tmp\my-model"

Loading the checkpoint should be now possible.

Upvotes: 3

simo23
simo23

Reputation: 506

Are the files just named line that? starting with dot?

If that is the case you should consider to save them differently because this could be the problem.

Try with:

NUMBER_OF_CKPT = 60 saver.save(sess,save_path,global_step=NUMBER_OF_CKPT)

What is usually done is to save also the global_step as the number of the ckpt.

Hope to have solved it!

Upvotes: 2

Related Questions