Reputation: 1
I need use virtual environment in pyspark EMR cluster.
I am launching application with spark-sumbit using the following configuration.
spark-submit --deploy-mode client --archives path_to/environment.tar.gz#environment --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python
Environment variables are setting in python script. The code is working importing packages inside spark context function, but i wannt import outside the function. What's wrong?
from pyspark import SparkConf
from pyspark import SparkContext
import pendulum #IMPORT ERROR!!!!!!!!!!!!!!!!!!!!!!!
os.environ["PYSPARK_PYTHON"] = "./environment/bin/python"
os.environ["PYSPARK_DRIVER_PYTHON"] = "which python"
conf = SparkConf()
conf.setAppName('spark-yarn')
sc = SparkContext(conf=conf)
def some_function(x):
import pendulum #IMPORT CORRECT
dur = pendulum.duration(days=x)
# More properties
# Use the libraries to do work
return dur.weeks
rdd = (sc.parallelize(range(1000))
.map(some_function)
.take(10))
print(rdd)
import pendulum #IMPORT ERROR
Upvotes: 0
Views: 365
Reputation: 525
from the looks of it, seems you are using the correct environment for the workers ("PYSPARK_PYTHON"
) but override the environment for the driver ("PYSPARK_DRIVER_PYTHON"
) so that the driver's python does not see the package you wanted to import. the code in the some_function
gets executed by the workers ( never by the driver), and thus it can see the imported package, while the latest import is executed by the driver, and fails.
Upvotes: 0