robamemar
robamemar

Reputation: 1

Importing package in client mode PYSPARK

I need use virtual environment in pyspark EMR cluster.

I am launching application with spark-sumbit using the following configuration.

spark-submit --deploy-mode client --archives path_to/environment.tar.gz#environment --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python

Environment variables are setting in python script. The code is working importing packages inside spark context function, but i wannt import outside the function. What's wrong?

from pyspark import SparkConf
from pyspark import SparkContext
import pendulum #IMPORT ERROR!!!!!!!!!!!!!!!!!!!!!!!

os.environ["PYSPARK_PYTHON"] = "./environment/bin/python"
os.environ["PYSPARK_DRIVER_PYTHON"] = "which python"

conf = SparkConf()
conf.setAppName('spark-yarn')
sc = SparkContext(conf=conf)


def some_function(x):

    import pendulum #IMPORT CORRECT
    dur = pendulum.duration(days=x)

    # More properties

    # Use the libraries to do work
    return dur.weeks


rdd = (sc.parallelize(range(1000))
       .map(some_function)
       .take(10))

print(rdd)
import pendulum #IMPORT ERROR

Upvotes: 0

Views: 365

Answers (1)

Davide
Davide

Reputation: 525

from the looks of it, seems you are using the correct environment for the workers ("PYSPARK_PYTHON") but override the environment for the driver ("PYSPARK_DRIVER_PYTHON") so that the driver's python does not see the package you wanted to import. the code in the some_function gets executed by the workers ( never by the driver), and thus it can see the imported package, while the latest import is executed by the driver, and fails.

Upvotes: 0

Related Questions