alreadyexists
alreadyexists

Reputation: 355

spark.executorEnv doesn't seem to take any effect

According to the Spark docs, there is a way to pass environment variables to spawned executors:

spark.executorEnv.[EnvironmentVariableName] Add the environment variable specified by EnvironmentVariableName to the Executor process. The user can specify multiple of these to set multiple environment variables.

I'm trying to direct my pyspark app to use a specific python executable(anaconda environment with numpy etc etc), which is usually done by altering the PYSPARK_PYTHON variable in spark-env.sh. Although this way works, shipping new config to all the cluster nodes every time I want to switch a virtualenv looks like a huge overkill.

That's why I tried to pass PYSPARK_PYTHON in the following way:

uu@e1:~$ PYSPARK_DRIVER_PYTHON=ipython pyspark --conf \
spark.executorEnv.PYSPARK_PYTHON="/usr/share/anaconda/bin/python" \
--master spark://e1.local:7077

But it doesn't seem to work:

In [1]: sc._conf.getAll()
Out[1]: 
[(u'spark.executorEnv.PYSPARK_PYTHON', u'/usr/share/anaconda/bin/python'),
 (u'spark.rdd.compress', u'True'),
 (u'spark.serializer.objectStreamReset', u'100'),
 (u'spark.master', u'spark://e1.local:7077'),
 (u'spark.submit.deployMode', u'client'),
 (u'spark.app.name', u'PySparkShell')]

In [2]: def dummy(x):
    import sys
    return sys.executable
   ...: 

In [3]: sc.parallelize(xrange(100),50).map(dummy).take(10)

Out[3]: 
['/usr/bin/python2.7',
 '/usr/bin/python2.7',
 '/usr/bin/python2.7',
 '/usr/bin/python2.7',
 '/usr/bin/python2.7',
 '/usr/bin/python2.7',
 '/usr/bin/python2.7',
 '/usr/bin/python2.7',
 '/usr/bin/python2.7',
 '/usr/bin/python2.7']

My spark-env.sh does not have PYSPARK_PYTHON configured, so this is just the default python that gets called. Some additional info: it's spark 1.6.0 standalone mode cluster.

Am I missing something important here?

Upvotes: 0

Views: 7839

Answers (1)

charles gomes
charles gomes

Reputation: 2155

Taking a quick peek at https://github.com/apache/spark/blob/master/bin/pyspark

I think they are just doing export Can you do

export PYSPARK_PYTHON="/usr/share/anaconda/bin/python"

to see if it applies that to all executors and then just run

PYSPARK_DRIVER_PYTHON=ipython pyspark

Upvotes: 2

Related Questions