Reputation: 5513
I have written a Spark script that depends on six
and various other python packages.
$ cat ./test_package/__init__.py
from six.moves.urllib.request import urlopen
def download_size(url):
return len(urlopen(url).read())
As such, I have written a setup.py
which states such dependencies.
$ cat ./setup.py
from setuptools import setup
setup(
name="Test App",
packages=['test_package'],
version="0.1",
install_requires=['six>=1.0'],
Then in my spark script, I have code which requires the package
$ cat spark_script.py
#!/usr/lib/spark/bin/spark-submit
from pyspark import SparkContext
from glob import glob
from test_package import download_size
sc = SparkContext()
sc.addPyFile(glob('dist/Test_App-*.egg')[0])
...
sc.parallelize(urls).map(download_size).collect()
If I run
$ ./test.py
It works fine. However, if I try to use python3,
$ PYSPARK_PYTHON=python3 ./test.py
The master node is able to import test_package
, but in the middle of the mapreduce I get this on each worker node:
File "/hadoop/yarn/nm-local-dir/usercache/sam/appcache/application_1487279780844_0041/container_1487279780844_0041_01_000003/pyspark.zip/pyspark/serializers.py", line 419, in loads
return pickle.loads(obj, encoding=encoding)
File "./Test_App-0.1-py2.7.egg/test_package/__init__.py", line 2, in <module>
from six.moves.urllib.request import urlopen
ImportError: No module named 'six'
How do I manage python dependencies on an Google cloud dataproc provisioned Apache spark cluster?
Upvotes: 1
Views: 3334
Reputation: 10677
Since worker tasks will run on worker nodes and you only manually installed your extra python packages, the worker nodes don't have the same configuration available as your master node.
You should use Dataproc initialization actions to run your customization scripts on all nodes of the cluster at cluster-deployment time. For environment variables like PYSPARK_PYTHON
you probably need to append those settings to /etc/spark/conf/spark-env.sh
.
Upvotes: 1