Reputation: 992
I have set PYSPARK_PYTHON=python3 in spark-env.sh using Ambari, and when I try 'pyspark' on the command line, it runs with Python 3.4.3. However, when I submit a job using a Yarn cluster mode, it runs using Python 2.7.9. How do I make it use Python 3?
Upvotes: 0
Views: 597
Reputation: 29227
Explanation: In the Python driver program, SparkContext uses Py4J to launch a JVM and create a JavaSparkContext. Py4J is only used on the driver for local communication between the Python and Java SparkContext objects; large data transfers are performed through a different mechanism.
RDD transformations in Python are mapped to transformations on PythonRDD objects in Java. On remote worker machines, PythonRDD objects launch Python subprocesses and communicate with them using pipes, sending the user's code and the data to be processed.
Solution:
Just before creating a Spark session, use the environment variables from Python just like in the below example snippet:
from pyspark.sql import SparkSession os.environ['PYSPARK_PYTHON'] = sys.executable os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable # Initialize Spark session spark = SparkSession.builder \ .appName("String to CSV") \ .getOrCreate()
Upvotes: 0
Reputation: 3554
You need to give full path of python3 like:
subprocess.call(['export PYSPARK_PYTHON=/usr/local/bin/python2.7'],shell=True)
Upvotes: 0