Reputation: 61
I encountered an error: "Internal Server Error: We encountered an internal error. Please try again." when I run following script.
This error occurs suddenly after completed some tasks. Suddenly.
from sagemaker.network import NetworkConfig
from sagemaker.processing import ProcessingInput, ProcessingOutput, Processor
processor = Processor(
role=****,
image_uri=****,
instance_count=1,
instance_type="m5.large",
network_config=NetworkConfig(security_group_ids=[****], subnets=[****])
)
processor.run(
inputs=[
ProcessingInput(***),
],
outputs=[
ProcessingOutput(
source="****",
destination="****",
s3_upload_mode="Continuous",
)
]
)
Stack trace is followings.
File "run_sagemaker.py", line 44, in process2
processor.run(
File "/home/lubuntu/.miniconda/envs/sagemaker/lib/python3.8/site-packages/sagemaker/processing.py", line 165, in run
self.latest_job.wait(logs=logs)
File "/home/lubuntu/.miniconda/envs/sagemaker/lib/python3.8/site-packages/sagemaker/processing.py", line 731, in wait
self.sagemaker_session.logs_for_processing_job(self.job_name, wait=True)
File "/home/lubuntu/.miniconda/envs/sagemaker/lib/python3.8/site-packages/sagemaker/session.py", line 3167, in logs_for_processing_job
self._check_job_status(job_name, description, "ProcessingJobStatus")
File "/home/lubuntu/.miniconda/envs/sagemaker/lib/python3.8/site-packages/sagemaker/session.py", line 2666, in _check_job_status
raise exceptions.UnexpectedStatusException(
sagemaker.exceptions.UnexpectedStatusException: Error for Processing job ****: Failed. Reason: InternalServerError: We encountered an internal error. Please try again.
But when I set s3_upload_mode="EndOfJob", this error didn't occur.
My pc environment is
What is wrong for me?
Please lend me your wisdom.
Upvotes: 6
Views: 13265
Reputation: 192
I'm not sure if the following the right answer for you, but it was for me having the same issue in nearly the same scenario, and may help future readers. There seem to be a couple of conditions in which Sagemaker fails with no proper error message (just giving us "Internal server error"), one of them being errors when retrieving the image from ECR: https://github.com/aws/sagemaker-python-sdk/issues/70#issuecomment-637864892 So, double check that your execution role has the correct access, and that the URI is spelled correctly. I changed the permissions of the executing role to
Effect: Allow
Action:
- "ecr:*" # previously only ecr:BatchGetImage
Resource:
- arn:aws:ecr:<my-region>:<my-acc-no>:repository/<my-repo>
, and it runs fine.
Since the "internal server error" message does not tell which privileges were lacking before, I cannot tell which are the ones Sagemaker really needs on ECR and have not yet bothered finding out by trial-and-error, and have since stuck with the generous ecr:*
allowance (as is the setting in service roles which are created from the Sagemaker web console).
Upvotes: 2
Reputation: 332
will you be able to go to AWS SageMaker console -> Processing -> Processing Jobs to get cloudwatch logs and post more accurate error messages?
Also, it looks like you should be using SageMaker instance type instead of "m5.xlarge": https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/processing.py#L63
Upvotes: 0