micmia
micmia

Reputation: 1401

How to install external Python dependencies for Spark workers in a PySpark cluster?

I am running a Spark job with a PySpark user-defined function (UDF) that requires the requests package, which is not a standard built-in Python package. When executing the job, I encounter the error ModuleNotFoundError: No module named 'requests'. I understand that the dependencies are missing on the workers, causing the UDF to fail.

I tried following the instructions in the official documentation (https://spark.apache.org/docs/latest/api/python/user_guide/python_packaging.html) using pyspark.SparkContext.addPyFile(), but I couldn't get it to work.

Could someone provide a simple guide or example on how to install external Python dependencies for Spark workers in a PySpark cluster? Any help would be appreciated.

Upvotes: 1

Views: 1041

Answers (1)

micmia
micmia

Reputation: 1401

Finally, I solved my problem. For anyone else facing a similar problem, these steps seem like a viable solution.

  1. Execute the following command under the root directory of your project:
pip install --platform manylinux2014_x86_64 --target=dependencies --implementation cp --python-version 3.11 --only-binary=:all: --upgrade requests

If you are using Poetry, the command should be:

poetry run pip install --platform manylinux2014_x86_64 --target=dependencies --implementation cp --python-version 3.11 --only-binary=:all: --upgrade requests

This command installs the requests package into a folder named dependencies, which is targeting the manylinux2014_x86_64 + Python 3.11 executor (worker) environment.

  1. Zip all the files in the dependencies folder into a file named dependencies.zip, which serves as the zipped Python packages mentioned in the official documentation.

  2. Make the following adjustments to your code:

from pyspark.sql import functions as sf

spark.sparkContext.addPyFile('dependencies.zip')

def func(a):
    import requests
    # your code here

spark.udf.register("func", sf.udf(func))
df = spark.sql('SELECT func(...) FROM ...')

By following these steps, the requests package should be correctly distributed to the workers in your Spark cluster, allowing your UDF to execute without any issues related to missing dependencies.

Upvotes: 0

Related Questions