Learning Everyday
Learning Everyday

Reputation: 53

Python version different in worker and driver

The question I am trying to answer is:

Create RDD

Use the map to create an RDD of the NumPy arrays specified by the columns. The name of the RDD would be Rows

My code: Rows = df.select(col).rdd.map(make_array)

After I type this, I get a strange error, which basically says: Exception: Python in worker has different version 2.7 than that in driver 3.6, PySpark cannot run with different minor versions. Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.

enter image description here

I know I am working in an environment with Python 3.6. I am not sure if this specific line of code is triggering this error? What do you think

Just to note, this isn't my first line of code on this Jupyter notebook. If you need more information, please let me know and I will provide it. I can't understand why this is happening.

Upvotes: 2

Views: 4776

Answers (2)

rfs
rfs

Reputation: 91

In a notebook recently, I had to add those lines at the beginning, to sync the python versions:

import os
import sys
os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable

Upvotes: 2

Pierre Gourseaud
Pierre Gourseaud

Reputation: 2477

Your slaves and your driver are not using the same version of Python, which will trigger this error anytime you use Spark.

Make sure you have Python 3.6 installed on your slaves then (in Linux) modify your spark/conf/spark-env.sh file to add PYSPARK_PYTHON=/usr/local/lib/python3.6 (if this is the python directory in your slaves)

Upvotes: 2

Related Questions