mankand007
mankand007

Reputation: 992

PySpark does not use Python 3 in Yarn cluster mode, even with PYSPARK_PYTHON=python3

I have set PYSPARK_PYTHON=python3 in spark-env.sh using Ambari, and when I try 'pyspark' on the command line, it runs with Python 3.4.3. However, when I submit a job using a Yarn cluster mode, it runs using Python 2.7.9. How do I make it use Python 3?

Upvotes: 0

Views: 597

Answers (2)

Ram Ghadiyaram
Ram Ghadiyaram

Reputation: 29227

Enter image description here

Explanation: In the Python driver program, SparkContext uses Py4J to launch a JVM and create a JavaSparkContext. Py4J is only used on the driver for local communication between the Python and Java SparkContext objects; large data transfers are performed through a different mechanism.

RDD transformations in Python are mapped to transformations on PythonRDD objects in Java. On remote worker machines, PythonRDD objects launch Python subprocesses and communicate with them using pipes, sending the user's code and the data to be processed.

Solution:

Just before creating a Spark session, use the environment variables from Python just like in the below example snippet:

from pyspark.sql import SparkSession
os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable
# Initialize Spark session
spark = SparkSession.builder \
    .appName("String to CSV") \
    .getOrCreate()

Upvotes: 0

abhiieor
abhiieor

Reputation: 3554

You need to give full path of python3 like:

subprocess.call(['export PYSPARK_PYTHON=/usr/local/bin/python2.7'],shell=True)

Upvotes: 0

Related Questions