Reputation: 53
The question I am trying to answer is:
Create RDD
Use the map to create an RDD of the NumPy arrays specified by the columns. The name of the RDD would be Rows
My code: Rows = df.select(col).rdd.map(make_array)
After I type this, I get a strange error, which basically says: Exception: Python in worker has different version 2.7 than that in driver 3.6, PySpark cannot run with different minor versions. Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.
I know I am working in an environment with Python 3.6. I am not sure if this specific line of code is triggering this error? What do you think
Just to note, this isn't my first line of code on this Jupyter notebook. If you need more information, please let me know and I will provide it. I can't understand why this is happening.
Upvotes: 2
Views: 4776
Reputation: 91
In a notebook recently, I had to add those lines at the beginning, to sync the python versions:
import os
import sys
os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable
Upvotes: 2
Reputation: 2477
Your slaves and your driver are not using the same version of Python, which will trigger this error anytime you use Spark.
Make sure you have Python 3.6 installed on your slaves then (in Linux) modify your spark/conf/spark-env.sh
file to add PYSPARK_PYTHON=/usr/local/lib/python3.6
(if this is the python directory in your slaves)
Upvotes: 2