Harshit Kumar
Harshit Kumar

Reputation: 109

How to add model Checkpoint as Callback, when running model on TPU?

I am trying to save my model by using tf.keras.callbacks.ModelCheckpoint with filepath as some folder in drive, but I am getting this error:

File system scheme '[local]' not implemented (file: './ckpt/tensorflow/training_20220111-093004_temp/part-00000-of-00001')

Encountered when executing an operation using EagerExecutor. This error cancels all future operations and poisons their output tensors.

Does anybody know what is the reason for this and the workaround for this?

Upvotes: 1

Views: 449

Answers (1)

Sascha Kirch
Sascha Kirch

Reputation: 514

Looks to me that you are trying to access the file system of your host VM from the TPU which is not directly possible.

When using the TPU and you want to access files in e.g. GoogleColab you should place it within:

with tf.device('/job:localhost'):
  <YOUR_CODE>

Now to your problem: The local host acts as parameter server when training on TPU. So if you want to checkpoint your training, the localhost must do so. When you check the documention for said callback, you cann find the parameter options.

checkpoint_options = tf.train.CheckpointOptions(experimental_io_device='/job:localhost')
checkpoint = tf.keras.callbacks.ModelCheckpoint(<YOUR_PATH>, options = checkpoint_options)

Hope this solves your issue!

Best, Sascha

Upvotes: 3

Related Questions