iCHAIT
iCHAIT

Reputation: 594

How to install python packages within Amazon Sagemaker Processing Job?

I am trying to create a Sklearn processing job in Amazon Sagemekar to perform some data transformation of my input data before I do model training.

I wrote a custom python script preprocessing.py which does the needful. I use some python package in this script. Here is the Sagemaker example I followed.

When I try to submit the Processing Job I get an error -

............................Traceback (most recent call last):
  File "/opt/ml/processing/input/code/preprocessing.py", line 6, in <module>
    import snowflake.connector
ModuleNotFoundError: No module named 'snowflake.connector'

I understand that my processing job is unable to find this package and I need to install it. My question is how can I accomplish this using Sagemaker Processing Job API? Ideally there should be a way to define a requirements.txt in the API call, but I don't see such functionality in the docs.

I know I can create a custom Image with relevant packages and later use this image in the Processing Job, but this seems too much work for something that should be built-in?

Is there an easier/elegant way to install packages needed in Sagemaker Processing Job ?

Upvotes: 11

Views: 12294

Answers (2)

Lukas Hestermeyer
Lukas Hestermeyer

Reputation: 1033

Another thing you can do is by having a bash script instead of a python file as entrypoint.

I defined the entrypoint in CloudFormation like this:

...
MyProcessingJob:
  Type: Task
  Resource: arn:aws:states:::sagemaker:createProcessingJob.sync
  Parameters:
    AppSpecification:
      ContainerEntrypoint: ["bash", "/opt/ml/processing/input/code/start_process.sh"]
      ImageUri: "492215442770.dkr.ecr.eu-central-1.amazonaws.com/sagemaker-scikit-learn:0.20.0-cpu-py3"
...

The bash script could look something like:

# cd into folder
cd /opt/ml/processing/input/code

# install requirements
pip install -r requirements.txt

# start preprocessing
/miniconda3/bin/python -m entryscript --parameter value

This is what I used on the standard SKLearn Docker Image and it works great.

Upvotes: 1

Neil McGuigan
Neil McGuigan

Reputation: 48256

One way would be to call pip from Python:

subprocess.check_call([sys.executable, "-m", "pip", "install", package])

Another way would be to use an SKLearn Estimator (training job) instead, to do the same thing. You can provide the source_dir, which can include a requirements.txt file, and these requirements will be installed for you

estimator = SKLearn(
    entry_point="foo.py",
    source_dir="./foo", # no trailing slash! put requirements.txt here
    framework_version="0.23-1",
    role = ...,
    instance_count = 1,
    instance_type = "ml.m5.large"
)

Upvotes: 8

Related Questions