How best to install dependencies in a Sagemaker PySpark cluster

Question

I'm trying to run a processing job for machine learning using the new Sagemaker Spark container. The cluster launches, but I immediately run into an ImportError - my dependencies are missing.

I get that the Spark container doesn't have those dependencies, and I've tried to follow steps outlined on SO to install them - namely, using the submit_py_files parameter in PySparkProcessor.run() to submit a .zip file of all my dependencies. However, it doesn't seem to be installing them.

Is there a way to use the Sagemaker PySparkProcessor class to execute a bootstrap script when a cluster launches? I'm currently trying to run a processing workload that uses pandas_udfs, and seeing ImportError when the cluster tries to use PyArrow:

Traceback (most recent call last):
    File "/opt/ml/processing/input/code/spark_preprocess.py", line 35 in 
    @pandas_udf("float", PandasUDFType.GROUPED_AGG)
    File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 47, in _create_udf
    File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 149 in require_minimum_pyarrow_version
    ImportError: PyArrow >= 0.8.0 must be installed; however, it was not found.

The dependencies.zip contains PyArrow 0.16.0, and I'm using the latest version of the Sagemaker Python SDK.

I know with EMR you can submit a bootstrap action script to install dependencies - is there a similar option here? Thanks!

How best to install dependencies in a Sagemaker PySpark cluster

Answers (1)

Related Questions