Reputation: 355
According to the Spark docs, there is a way to pass environment variables to spawned executors:
spark.executorEnv.[EnvironmentVariableName] Add the environment variable specified by EnvironmentVariableName to the Executor process. The user can specify multiple of these to set multiple environment variables.
I'm trying to direct my pyspark app to use a specific python executable(anaconda environment with numpy etc etc), which is usually done by altering the PYSPARK_PYTHON
variable in spark-env.sh
. Although this way works, shipping new config to all the cluster nodes every time I want to switch a virtualenv looks like a huge overkill.
That's why I tried to pass PYSPARK_PYTHON in the following way:
uu@e1:~$ PYSPARK_DRIVER_PYTHON=ipython pyspark --conf \
spark.executorEnv.PYSPARK_PYTHON="/usr/share/anaconda/bin/python" \
--master spark://e1.local:7077
But it doesn't seem to work:
In [1]: sc._conf.getAll()
Out[1]:
[(u'spark.executorEnv.PYSPARK_PYTHON', u'/usr/share/anaconda/bin/python'),
(u'spark.rdd.compress', u'True'),
(u'spark.serializer.objectStreamReset', u'100'),
(u'spark.master', u'spark://e1.local:7077'),
(u'spark.submit.deployMode', u'client'),
(u'spark.app.name', u'PySparkShell')]
In [2]: def dummy(x):
import sys
return sys.executable
...:
In [3]: sc.parallelize(xrange(100),50).map(dummy).take(10)
Out[3]:
['/usr/bin/python2.7',
'/usr/bin/python2.7',
'/usr/bin/python2.7',
'/usr/bin/python2.7',
'/usr/bin/python2.7',
'/usr/bin/python2.7',
'/usr/bin/python2.7',
'/usr/bin/python2.7',
'/usr/bin/python2.7',
'/usr/bin/python2.7']
My spark-env.sh
does not have PYSPARK_PYTHON
configured, so this is just the default python that gets called. Some additional info: it's spark 1.6.0 standalone mode cluster.
Am I missing something important here?
Upvotes: 0
Views: 7839
Reputation: 2155
Taking a quick peek at https://github.com/apache/spark/blob/master/bin/pyspark
I think they are just doing export Can you do
export PYSPARK_PYTHON="/usr/share/anaconda/bin/python"
to see if it applies that to all executors and then just run
PYSPARK_DRIVER_PYTHON=ipython pyspark
Upvotes: 2