suri
suri

Reputation: 21

Amazon sagemaker training job using prebuild docker image

Hi I am newbie to AWS Sagemaker, I am trying to deploying the custom time series model on sagemaker, so for that build a docker image using sagemaker terminal,But when i am trying to creating training job it showing some error.I am struggling with past four days, please any one could help me. Here my code:

lstm = sage.estimator.Estimator(image,
                       role, 1, 'ml.m4.xlarge',
                       output_path='s3://' + s3Bucket,
                       sagemaker_session=sess)

lstm.fit(upload_data)

Here my Error, I attached policy of ecr full access permissions to sagemaker Iam role and also account is in same region.

ClientErrorTraceback (most recent call last)
<ipython-input-48-1d7f3ff70f18> in <module>()
      4                        sagemaker_session=sess)
      5 
----> 6 lstm.fit(upload_data)

/home/ec2-user/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/sagemaker/estimator.pyc in fit(self, inputs, wait, logs, job_name, experiment_config)
    472         self._prepare_for_training(job_name=job_name)
    473 
--> 474         self.latest_training_job = _TrainingJob.start_new(self, inputs, experiment_config)
    475         self.jobs.append(self.latest_training_job)
    476         if wait:

/home/ec2-user/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/sagemaker/estimator.pyc in start_new(cls, estimator, inputs, experiment_config)
   1036             train_args["enable_sagemaker_metrics"] = estimator.enable_sagemaker_metrics
   1037 
-> 1038         estimator.sagemaker_session.train(**train_args)
   1039 
   1040         return cls(estimator.sagemaker_session, estimator._current_job_name)

/home/ec2-user/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/sagemaker/session.pyc in train(self, input_mode, input_config, role, job_name, output_config, resource_config, vpc_config, hyperparameters, stop_condition, tags, metric_definitions, enable_network_isolation, image, algorithm_arn, encrypt_inter_container_traffic, train_use_spot_instances, checkpoint_s3_uri, checkpoint_local_path, experiment_config, debugger_rule_configs, debugger_hook_config, tensorboard_output_config, enable_sagemaker_metrics)
    588         LOGGER.info("Creating training-job with name: %s", job_name)
    589         LOGGER.debug("train request: %s", json.dumps(train_request, indent=4))
--> 590         self.sagemaker_client.create_training_job(**train_request)
    591 
    592     def process(

/home/ec2-user/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/botocore/client.pyc in _api_call(self, *args, **kwargs)
    314                     "%s() only accepts keyword arguments." % py_operation_name)
    315             # The "self" in this scope is referring to the BaseClient.
--> 316             return self._make_api_call(operation_name, kwargs)
    317 
    318         _api_call.__name__ = str(py_operation_name)

/home/ec2-user/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/botocore/client.pyc in _make_api_call(self, operation_name, api_params)
    624             error_code = parsed_response.get("Error", {}).get("Code")
    625             error_class = self.exceptions.from_code(error_code)
--> 626             raise error_class(parsed_response, operation_name)
    627         else:
    628             return parsed_response

ClientError: An error occurred (ValidationException) when calling the CreateTrainingJob operation: Cannot find repository: sagemaker-model in registry ID: 534860077983 Please check if your ECR repository exists and role arn:aws:iam::534860077983:role/service-role/AmazonSageMaker-ExecutionRole-20190508T215284 has proper pull permissions for SageMaker: ecr:BatchCheckLayerAvailability, ecr:BatchGetImage, ecr:GetDownloadUrlForLayer

Upvotes: 0

Views: 672

Answers (1)

jigsawmnc
jigsawmnc

Reputation: 444

TL;DR: Seems like you're not providing the correct repository for the ECR image to the SageMaker estimator. Maybe the repository doesn't exist?

Also make sure that the repository's permissions are configured to allow the principal sagemaker.amazonaws.com to do ecr:BatchCheckLayerAvailability, ecr:BatchGetImage, ecr:GetDownloadUrlForLayer

Upvotes: 0

Related Questions