Reputation: 594
I am trying to create a Sklearn processing job in Amazon Sagemekar to perform some data transformation of my input data before I do model training.
I wrote a custom python script preprocessing.py
which does the needful. I use some python package in this script. Here is the Sagemaker example I followed.
When I try to submit the Processing Job I get an error -
............................Traceback (most recent call last):
File "/opt/ml/processing/input/code/preprocessing.py", line 6, in <module>
import snowflake.connector
ModuleNotFoundError: No module named 'snowflake.connector'
I understand that my processing job is unable to find this package and I need to install it. My question is how can I accomplish this using Sagemaker Processing Job API? Ideally there should be a way to define a requirements.txt
in the API call, but I don't see such functionality in the docs.
I know I can create a custom Image with relevant packages and later use this image in the Processing Job, but this seems too much work for something that should be built-in?
Is there an easier/elegant way to install packages needed in Sagemaker Processing Job ?
Upvotes: 11
Views: 12294
Reputation: 1033
Another thing you can do is by having a bash script instead of a python file as entrypoint.
I defined the entrypoint in CloudFormation like this:
...
MyProcessingJob:
Type: Task
Resource: arn:aws:states:::sagemaker:createProcessingJob.sync
Parameters:
AppSpecification:
ContainerEntrypoint: ["bash", "/opt/ml/processing/input/code/start_process.sh"]
ImageUri: "492215442770.dkr.ecr.eu-central-1.amazonaws.com/sagemaker-scikit-learn:0.20.0-cpu-py3"
...
The bash script could look something like:
# cd into folder
cd /opt/ml/processing/input/code
# install requirements
pip install -r requirements.txt
# start preprocessing
/miniconda3/bin/python -m entryscript --parameter value
This is what I used on the standard SKLearn Docker Image and it works great.
Upvotes: 1
Reputation: 48256
One way would be to call pip from Python:
subprocess.check_call([sys.executable, "-m", "pip", "install", package])
Another way would be to use an SKLearn Estimator (training job) instead, to do the same thing. You can provide the source_dir
, which can include a requirements.txt
file, and these requirements will be installed for you
estimator = SKLearn(
entry_point="foo.py",
source_dir="./foo", # no trailing slash! put requirements.txt here
framework_version="0.23-1",
role = ...,
instance_count = 1,
instance_type = "ml.m5.large"
)
Upvotes: 8