Reputation: 140
I am training a model using TensorFlow on Google Cloud's AI Platform and while the training itself proceeds nicely, I am unable to save the finished model in SavedModel format to my cloud storage bucket. I know the bucket is set up properly because at the beginning of training I download my training data from that very same bucket. Here is the code I use to save my model:
SAVE_PATH = os.path.join("gs://", 'machine-learning-ebay', 'job-dir')
linear_model.save(SAVE_PATH)
Where 'machine-learning-ebay' is the storage bucket and 'job-dir' is a folder within that storage bucket.
I receive the following error on the job description page in google cloud:
Traceback (most recent call last):
[...]
File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/training/tracking/util.py", line 1219, in save
file_prefix_tensor, object_graph_tensor, options)
File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/training/tracking/util.py", line 1164, in _save_cached_when_graph_building
save_op = saver.save(file_prefix, options=options)
File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/training/saving/functional_saver.py", line 300, in save
return save_fn()
File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/training/saving/functional_saver.py", line 287, in save_fn
sharded_prefixes, file_prefix, delete_old_dirs=True)
File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/ops/gen_io_ops.py", line 504, in merge_v2_checkpoints
delete_old_dirs=delete_old_dirs, name=name, ctx=_ctx)
File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/ops/gen_io_ops.py", line 528, in merge_v2_checkpoints_eager_fallback
attrs=_attrs, ctx=ctx, name=name)
File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.NotFoundError: Error executing an HTTP request: HTTP response code 404 with body '{
"error": {
"code": 404,
"message": "No such object: machine-learning-ebay/job-dir/variables/variables_temp/part-00000-of-00001.data-00000-of-00001",
"errors": [
{
"message": "No such object: machine-learning-ebay/job-dir/variables/variables_temp/part-00000-of-00001.data-00000-of-00001",
"domain": "global",
"reason": "notFound"
}
]
}
}
Any help is greatly appreciated; the deadline for this project is today.
Upvotes: 0
Views: 301
Reputation: 140
Following the code in Google's training example (https://github.com/GoogleCloudPlatform/cloudml-samples/blob/main/census/tf-keras/trainer/task.py) and a GitHub issue which said that timestamping the output folders solves problems of overwriting (https://github.com/kubeflow/pipelines/issues/2171), I changed my export code to the following:
current_time = now.strftime("%H.%M.%S")
tf.compat.v1.keras.experimental.export_saved_model(linear_model,'gs://machine-learning-ebay/job-dir/keras-export'+current_time)
This resolved the training errors I was facing, exporting the model successfully.
Upvotes: 2