s059ff
s059ff

Reputation: 61

InternalServerError: We encountered an internal error. Please try again

I encountered an error: "Internal Server Error: We encountered an internal error. Please try again." when I run following script.

This error occurs suddenly after completed some tasks. Suddenly.

from sagemaker.network import NetworkConfig
from sagemaker.processing import ProcessingInput, ProcessingOutput, Processor

processor = Processor(
    role=****,
    image_uri=****,
    instance_count=1,
    instance_type="m5.large",
    network_config=NetworkConfig(security_group_ids=[****], subnets=[****])
)
processor.run(
    inputs=[
        ProcessingInput(***),
    ],
    outputs=[
        ProcessingOutput(
            source="****",
            destination="****",
            s3_upload_mode="Continuous",
        )
    ]
)

Stack trace is followings.

  File "run_sagemaker.py", line 44, in process2
    processor.run(
  File "/home/lubuntu/.miniconda/envs/sagemaker/lib/python3.8/site-packages/sagemaker/processing.py", line 165, in run
    self.latest_job.wait(logs=logs)
  File "/home/lubuntu/.miniconda/envs/sagemaker/lib/python3.8/site-packages/sagemaker/processing.py", line 731, in wait
    self.sagemaker_session.logs_for_processing_job(self.job_name, wait=True)
  File "/home/lubuntu/.miniconda/envs/sagemaker/lib/python3.8/site-packages/sagemaker/session.py", line 3167, in logs_for_processing_job
    self._check_job_status(job_name, description, "ProcessingJobStatus")
  File "/home/lubuntu/.miniconda/envs/sagemaker/lib/python3.8/site-packages/sagemaker/session.py", line 2666, in _check_job_status
    raise exceptions.UnexpectedStatusException(
sagemaker.exceptions.UnexpectedStatusException: Error for Processing job ****: Failed. Reason: InternalServerError: We encountered an internal error.  Please try again.

But when I set s3_upload_mode="EndOfJob", this error didn't occur.

My pc environment is

What is wrong for me?

Please lend me your wisdom.

Upvotes: 6

Views: 13265

Answers (2)

AdagioMolto
AdagioMolto

Reputation: 192

I'm not sure if the following the right answer for you, but it was for me having the same issue in nearly the same scenario, and may help future readers. There seem to be a couple of conditions in which Sagemaker fails with no proper error message (just giving us "Internal server error"), one of them being errors when retrieving the image from ECR: https://github.com/aws/sagemaker-python-sdk/issues/70#issuecomment-637864892 So, double check that your execution role has the correct access, and that the URI is spelled correctly. I changed the permissions of the executing role to

Effect: Allow
Action:
  - "ecr:*"  # previously only ecr:BatchGetImage
Resource:
  - arn:aws:ecr:<my-region>:<my-acc-no>:repository/<my-repo>

, and it runs fine.

Since the "internal server error" message does not tell which privileges were lacking before, I cannot tell which are the ones Sagemaker really needs on ECR and have not yet bothered finding out by trial-and-error, and have since stuck with the generous ecr:* allowance (as is the setting in service roles which are created from the Sagemaker web console).

Upvotes: 2

Deng Chuyang
Deng Chuyang

Reputation: 332

will you be able to go to AWS SageMaker console -> Processing -> Processing Jobs to get cloudwatch logs and post more accurate error messages?

Also, it looks like you should be using SageMaker instance type instead of "m5.xlarge": https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/processing.py#L63

Upvotes: 0

Related Questions