Reputation: 1401
I am running a Spark job with a PySpark user-defined function (UDF) that requires the requests
package, which is not a standard built-in Python package. When executing the job, I encounter the error ModuleNotFoundError: No module named 'requests'
. I understand that the dependencies are missing on the workers, causing the UDF to fail.
I tried following the instructions in the official documentation (https://spark.apache.org/docs/latest/api/python/user_guide/python_packaging.html) using pyspark.SparkContext.addPyFile()
, but I couldn't get it to work.
Could someone provide a simple guide or example on how to install external Python dependencies for Spark workers in a PySpark cluster? Any help would be appreciated.
Upvotes: 1
Views: 1041
Reputation: 1401
Finally, I solved my problem. For anyone else facing a similar problem, these steps seem like a viable solution.
pip install --platform manylinux2014_x86_64 --target=dependencies --implementation cp --python-version 3.11 --only-binary=:all: --upgrade requests
If you are using Poetry, the command should be:
poetry run pip install --platform manylinux2014_x86_64 --target=dependencies --implementation cp --python-version 3.11 --only-binary=:all: --upgrade requests
This command installs the requests
package into a folder named dependencies
, which is targeting the manylinux2014_x86_64
+ Python 3.11 executor (worker) environment.
Zip all the files in the dependencies
folder into a file named dependencies.zip
, which serves as the zipped Python packages mentioned in the official documentation.
Make the following adjustments to your code:
from pyspark.sql import functions as sf
spark.sparkContext.addPyFile('dependencies.zip')
def func(a):
import requests
# your code here
spark.udf.register("func", sf.udf(func))
df = spark.sql('SELECT func(...) FROM ...')
By following these steps, the requests
package should be correctly distributed to the workers in your Spark cluster, allowing your UDF to execute without any issues related to missing dependencies.
Upvotes: 0