Reputation: 61
I'm trying to run a processing job for machine learning using the new Sagemaker Spark container. The cluster launches, but I immediately run into an ImportError - my dependencies are missing.
I get that the Spark container doesn't have those dependencies, and I've tried to follow steps outlined on SO to install them - namely, using the submit_py_files parameter in PySparkProcessor.run() to submit a .zip file of all my dependencies. However, it doesn't seem to be installing them.
Is there a way to use the Sagemaker PySparkProcessor class to execute a bootstrap script when a cluster launches? I'm currently trying to run a processing workload that uses pandas_udfs, and seeing ImportError when the cluster tries to use PyArrow:
Traceback (most recent call last):
File "/opt/ml/processing/input/code/spark_preprocess.py", line 35 in <module>
@pandas_udf("float", PandasUDFType.GROUPED_AGG)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 47, in _create_udf
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 149 in require_minimum_pyarrow_version
ImportError: PyArrow >= 0.8.0 must be installed; however, it was not found.
The dependencies.zip contains PyArrow 0.16.0, and I'm using the latest version of the Sagemaker Python SDK.
I know with EMR you can submit a bootstrap action script to install dependencies - is there a similar option here? Thanks!
Upvotes: 6
Views: 1293
Reputation: 41
Instead of directly using the PySparkProcessor, use SageMaker script mode. This mode allows you to use a script as an entry point, and you can specify dependencies and configurations through the requirements.txt file.
The code example is
from sagemaker.script import ScriptProcessor
script_processor = ScriptProcessor(
base_job_name="your-job-name",
image_uri="sagemaker-spark-your-region", # Use the appropriate Spark container URI
command=["/bin/bash"],
role="your-role",
instance_count=1,
instance_type="ml.m5.xlarge",
max_runtime_in_seconds=3600,
)
script_processor.run(
submit_app="your-entry-point-script.py",
arguments=["arg1", "arg2"],
# Add any other parameters as needed
)
In the same directory as your entry point script, create a requirements.txt file listing your dependencies. This will include PyArrow and any other required packages.
pandas==your_pandas_version
pyarrow==0.16.0
Upvotes: 0