Olav
Olav

Reputation: 21

Create large model version in Google ML fails

I've created a tensorflow session where the export.meta file is 553.17 MB. Whenever I try to load the exported graph into Google ML it crashes with the error:

gcloud beta ml models versions create --origin=${TRAIN_PATH}/model/ --model=${MODEL_NAME} v1

ERROR: (gcloud.beta.ml.models.versions.create) Error Response: [3] Create Version failed.Error accessing the model location gs://experimentation-1323-ml/face/model/. Please make sure that service account cloud-ml-service@experimentation-1323-10cd8.iam.gserviceaccount.com has read access to the bucket and the objects.

The graph is a static version of a VGG16 face recognition, so export is empty except for a dummy variable, while all the "weights" are constants in export.meta. Could that affect things? How do I go about debugging this?

Upvotes: 2

Views: 408

Answers (1)

rhaertel80
rhaertel80

Reputation: 8379

Update (11/18/2017)

The service currently expects deployed models to have checkpoint files. Some models, such as inception, have folded variables into constants and therefore do not have checkpoint files. We will work on addressing this limitation in the service. In the meantime, as a workaround, you can create a dummy variable, e.g.,

import os

output_dir = 'my/output/dir'
dummy = tf.Variable([0])
saver = tf.train.Saver()

with tf.Session() as sess:
  sess.run(tf.initialize_all_variables())
  saver.save(sess, os.path.join(output_dir, 'export'))

Update (11/17/2017)

A previous version of this post noted that the root cause of the problem was that the training service was producing V2 checkpoints but the prediction service was unable to consume them. This has now been fixed, so it is no longer necessary to force training to write V1 checkpoints; by default, V2 checkpoints are written.

Please retry.

Previous Answer

For future posterity, the following was the original answer, which may still apply to some users in some cases, so leaving here:

The error indicates that this is a permissions problem, and not related to the size of the model. The getting started instructions recommend running:

gcloud beta ml init-project

That generally sets up the permissions properly, as long as the bucket that has the model ('experimentation-1323-ml') is in the same project as you are using to deploy the model (the normal situation).

If things still aren't working, you'll need to follow these instructions for manually setting the correct permissions.

Upvotes: 4

Related Questions