Kevin
Kevin

Reputation: 3431

How do I set the driver's python version in spark?

I'm using spark 1.4.0-rc2 so I can use python 3 with spark. If I add export PYSPARK_PYTHON=python3 to my .bashrc file, I can run spark interactively with python 3. However, if I want to run a standalone program in local mode, I get an error:

Exception: Python in worker has different version 3.4 than that in driver 2.7, PySpark cannot run with different minor versions

How can I specify the version of python for the driver? Setting export PYSPARK_DRIVER_PYTHON=python3 didn't work.

Upvotes: 99

Views: 209981

Answers (20)

Owais Tahir
Owais Tahir

Reputation: 41

Here is a solution. The error i was getting said something similar to:

my worker node has python3.10 while the driver has python3.11

What I had to do was to navigate to /usr/bin/ and execute ls -l. I got the following output:

lrwxrwxrwx 1 root root    7 Feb 12 19:50 python -> python3
lrwxrwxrwx 1 root root    9 Oct 11  2021 python2 -> python2.7
-rwxr-xr-x 1 root root  14K Oct 11  2021 python2.7
-rwxr-xr-x 1 root root 1.7K Oct 11  2021 python2.7-config
lrwxrwxrwx 1 root root   16 Oct 11  2021 python2-config -> python2.7-config
lrwxrwxrwx 1 root root   10 Feb 12 19:50 python3 -> python3.10
-rwxr-xr-x 1 root root  15K Feb 12 19:50 python3.11
-rwxr-xr-x 1 root root 3.2K Feb 12 19:50 python3.11-config
lrwxrwxrwx 1 root root   17 Feb 12 19:50 python3-config -> python3.11-config
-rwxr-xr-x 1 root root 2.5K Apr  8  2023 python-argcomplete-check-easy-install-script
-rwxr-xr-x 1 root root  383 Apr  8  2023 python-argcomplete-tcsh
lrwxrwxrwx 1 root root   14 Feb 12 19:50 python-config -> python3-config

Notice the line lrwxrwxrwx 1 root root 10 Feb 12 19:50 python3 -> python3.10

I realized that my python was pointing python3.10 even though I had python3.11 installed. If that's the case with you, then the following fix should work:

  1. Locate Python 3.11: First, ensure that Python 3.11 is installed on your system. You can usually find it in /usr/bin/ or /usr/local/bin/. Let's assume it's in /usr/bin/python3.11.
  2. Update the Symbolic Link: Open a terminal and run:
    sudo ln -sf /usr/bin/python3.11 /usr/bin/python3
  3. Verify the Update: In the terminal, run:
    ls -l /usr/bin/python3

    This should show something like:
    lrwxrwxrwx 1 root root XX XXX XX:XX /usr/bin/python3 -> /usr/bin/python3.11

    This indicates that python3 now points to Python 3.11.

Now, when you run your PySpark code, it should use Python 3.11 on both the worker nodes and the driver.

Upvotes: 0

Devbrat Shukla
Devbrat Shukla

Reputation: 524

I was facing same issue working with pycharm and spark. to fix this error I have followed given below steps to fix it.

  1. Click on Run option in pycharm menu bar.

  2. Click on Edit Configurations option.

  3. Click on Environment Variables and write down given below lines as per your location.

    PYSPARK_PYTHON=/usr/bin/python3.6;
    PYSPARK_DRIVER_PYTHON=/usr/bin/python3.6;
    SPARK_HOME=/home/xxxxxx/Desktop/xxxx/spark
    

Upvotes: 1

chadmc
chadmc

Reputation: 81

I had the same problem, just forgot to activate my virtual environment.

Upvotes: 1

fccoelho
fccoelho

Reputation: 6204

Setting both PYSPARK_PYTHON=python3 and PYSPARK_DRIVER_PYTHON=python3 works for me.

I did this using export in my .bashrc. In the end, these are the variables I create:

export SPARK_HOME="$HOME/Downloads/spark-1.4.0-bin-hadoop2.4"
export IPYTHON=1
export PYSPARK_PYTHON=/usr/bin/python3
export PYSPARK_DRIVER_PYTHON=ipython3
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"

I also followed this tutorial to make it work from within Ipython3 notebook: http://ramhiser.com/2015/02/01/configuring-ipython-notebook-support-for-pyspark/

Upvotes: 100

Juan José
Juan José

Reputation: 1

If you are working on mac, use the following commands

export SPARK_HOME=`brew info apache-spark | grep /usr | tail -n 1 | cut -f 1 -d " "`/libexec
export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH

export HADOOP_HOME=`brew info hadoop | grep /usr | head -n 1 | cut -f 1 -d " "`/libexec
export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native/:$LD_LIBRARY_PATH
export PYSPARK_PYTHON=python3

If you are using another OS, check the following link: https://github.com/GalvanizeDataScience/spark-install

Upvotes: 0

Hillary Murefu
Hillary Murefu

Reputation: 96

Run:

ls -l /usr/local/bin/python*

The first row in this example shows the python3 symlink. To set it as the default python symlink run the following:

ln -s -f /usr/local/bin/python3 /usr/local/bin/python

then reload your shell.

Upvotes: 0

Muser
Muser

Reputation: 603

In my case (Ubuntu 18.04), I ran this code in terminal:

sudo vim ~/.bashrc

and then edited SPARK_HOME as follows:

export SPARK_HOME=/home/muser/programs/anaconda2019/lib/python3.7/site-packages/pyspark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

By doing so, my SPARK_HOME will refer to the pyspark package I installed in the site-package.

To learn how to use vim, go to this link.

Upvotes: 0

Justin Varughese
Justin Varughese

Reputation: 21

Please look at the below snippet:

#setting environment variable for pyspark in linux||ubuntu
#goto --- /usr/local/spark/conf
#create a new file named spark-env.sh copy all content of spark-env.sh.template to it
#then add below lines to it, with path to python

PYSPARK_PYTHON="/usr/bin/python3"
PYSPARK_DRIVER_PYTHON="/usr/bin/python3"
PYSPARK_DRIVER_PYTHON_OPTS="notebook --no-browser"
#i was running python 3.6 ||run - 'which python' in terminal to find the path of python

Upvotes: 1

Holden
Holden

Reputation: 7452

You need to make sure the standalone project you're launching is launched with Python 3. If you are submitting your standalone program through spark-submit then it should work fine, but if you are launching it with python make sure you use python3 to start your app.

Also, make sure you have set your env variables in ./conf/spark-env.sh (if it doesn't exist you can use spark-env.sh.template as a base.)

Upvotes: 34

Rizvi Hasan
Rizvi Hasan

Reputation: 712

I got the same issue on standalone spark in windows. My version of fix is like this: I had my environment variables setting as bellow

PYSPARK_SUBMIT_ARGS="pyspark-shell"
PYSPARK_DRIVER_PYTHON=jupyter
PYSPARK_DRIVER_PYTHON_OPTS='notebook' pyspark

With this setting I executed an Action on pyspark and got the following exception:

Python in worker has different version 3.6 than that in driver 3.5, PySpark cannot run with different minor versions.
Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.

To check with which python version my spark-worker is using hit the following in the cmd prompt.

python --version
Python 3.6.3

which showed me Python 3.6.3. So clearly my spark-worker is using system python which is v3.6.3.

Now as I set my spark-driver to run jupyter by setting PYSPARK_DRIVER_PYTHON=jupyter so I need to check the python version jupyter is using.

To do this check open Anaconda Prompt and hit

python --version
Python 3.5.X :: Anaconda, Inc.

Here got the jupyter python is using the v3.5.x. You can check this version also in any Notebook (Help->About).

Now I need to update the jupyter python to the version v3.6.6. To do that open up the Anaconda Prompt and hit

conda search python

This will give you a list of available python versions in Anaconda. Install your desired one with

conda install python=3.6.3

Now I have both of the Python installation of same version 3.6.3 Spark should not comply and it didn't when I ran an Action on Spark-driver. Exception is gone. Happy coding ...

Upvotes: 3

dbustosp
dbustosp

Reputation: 4458

I just faced the same issue and these are the steps that I follow in order to provide Python version. I wanted to run my PySpark jobs with Python 2.7 instead of 2.6.

  1. Go to the folder where $SPARK_HOME is pointing to (in my case is /home/cloudera/spark-2.1.0-bin-hadoop2.7/)

  2. Under folder conf, there is a file called spark-env.sh. In case you have a file called spark-env.sh.template you will need to copy the file to a new file called spark-env.sh.

  3. Edit the file and write the next three lines

    export PYSPARK_PYTHON=/usr/local/bin/python2.7

    export PYSPARK_DRIVER_PYTHON=/usr/local/bin/python2.7

    export SPARK_YARN_USER_ENV="PYSPARK_PYTHON=/usr/local/bin/python2.7"

  4. Save it and launch your application again :)

In that way, if you download a new Spark standalone version, you can set the Python version which you want to run PySpark to.

Upvotes: 10

Grr
Grr

Reputation: 16079

Ran into this today at work. An admin thought it prudent to hard code Python 2.7 as the PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON in $SPARK_HOME/conf/spark-env.sh. Needless to say this broke all of our jobs that utilize any other python versions or environments (which is > 90% of our jobs). @PhillipStich points out correctly that you may not always have write permissions for this file, as is our case. While setting the configuration in the spark-submit call is an option, another alternative (when running in yarn/cluster mode) is to set the SPARK_CONF_DIR environment variable to point to another configuration script. There you could set your PYSPARK_PYTHON and any other options you may need. A template can be found in the spark-env.sh source code on github.

Upvotes: 0

Phillip Stich
Phillip Stich

Reputation: 459

If you're running Spark in a larger organization and are unable to update the /spark-env.sh file, exporting the environment variables may not work.

You can add the specific Spark settings through the --conf option when submitting the job at run time.

pyspark --master yarn --[other settings]\ 
  --conf "spark.pyspark.python=/your/python/loc/bin/python"\ 
  --conf "spark.pyspark.driver.python=/your/python/loc/bin/python"

Upvotes: 13

William Lee
William Lee

Reputation: 115

Error

"Exception: Python in worker has different version 2.6 than that in driver  2.7, PySpark cannot run with different minor versions". 

Fix (for Cloudera environment)

  • Edit this file: /opt/cloudera/parcels/cdh5.5.4.p0.9/lib/spark/conf/spark-env.sh

  • Add these lines:

    export PYSPARK_PYTHON=/usr/bin/python
    export PYSPARK_DRIVER_PYTHON=python
    

Upvotes: 0

Nikolay Bystritskiy
Nikolay Bystritskiy

Reputation: 550

Helped in my case:

import os

os.environ["SPARK_HOME"] = "/usr/local/Cellar/apache-spark/1.5.1/"
os.environ["PYSPARK_PYTHON"]="/usr/local/bin/python3"

Upvotes: 31

Frank
Frank

Reputation: 825

I came across the same error message and I have tried three ways mentioned above. I listed the results as a complementary reference to others.

  1. Change the PYTHON_SPARK and PYTHON_DRIVER_SPARK value in spark-env.sh does not work for me.
  2. Change the value inside python script using os.environ["PYSPARK_PYTHON"]="/usr/bin/python3.5" os.environ["PYSPARK_DRIVER_PYTHON"]="/usr/bin/python3.5" does not work for me.
  3. Change the value in ~/.bashrc works like a charm~

Upvotes: 7

Peter Pan
Peter Pan

Reputation: 129

In case you only want to change the python version for current task, you can use following pyspark start command:

    PYSPARK_DRIVER_PYTHON=/home/user1/anaconda2/bin/python PYSPARK_PYTHON=/usr/local/anaconda2/bin/python pyspark --master ..

Upvotes: 1

George Fisher
George Fisher

Reputation: 3344

I am using the following environment

? python --version; ipython --version; jupyter --version
Python 3.5.2+
5.3.0
5.0.0

and the following aliases work well for me

alias pyspark="PYSPARK_PYTHON=/usr/local/bin/python3 PYSPARK_DRIVER_PYTHON=ipython ~/spark-2.1.1-bin-hadoop2.7/bin/pyspark --packages graphframes:graphframes:0.5.0-spark2.1-s_2.11"    
alias pysparknotebook="PYSPARK_PYTHON=/usr/bin/python3 PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS='notebook' ~/spark-2.1.1-bin-hadoop2.7/bin/pyspark --packages graphframes:graphframes:0.5.0-spark2.1-s_2.11"

In the notebook, I set up the environment as follows

from pyspark.context import SparkContext
sc = SparkContext.getOrCreate()

Upvotes: 0

Alex Punnen
Alex Punnen

Reputation: 6224

I was running it in IPython (as described in this link by Jacek Wasilewski ) and was getting this exception; Added PYSPARK_PYTHON to the IPython kernel file and used jupyter notebook to run, and started working.

vi  ~/.ipython/kernels/pyspark/kernel.json

{
 "display_name": "pySpark (Spark 1.4.0)",
 "language": "python",
 "argv": [
  "/usr/bin/python2",
  "-m",
  "IPython.kernel",
  "--profile=pyspark",
  "-f",
  "{connection_file}"
 ],

 "env": {
  "SPARK_HOME": "/usr/local/spark-1.6.1-bin-hadoop2.6/",
  "PYTHONPATH": "/usr/local/spark-1.6.1-bin-hadoop2.6/python/:/usr/local/spark-1
.6.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip",
  "PYTHONSTARTUP": "/usr/local/spark-1.6.1-bin-hadoop2.6/python/pyspark/shell.py
",
  "PYSPARK_SUBMIT_ARGS": "--master spark://127.0.0.1:7077 pyspark-shell",
  "PYSPARK_DRIVER_PYTHON":"ipython2",
   "PYSPARK_PYTHON": "python2"
 }

Upvotes: 3

James Clarke
James Clarke

Reputation: 151

You can specify the version of Python for the driver by setting the appropriate environment variables in the ./conf/spark-env.sh file. If it doesn't already exist, you can use the spark-env.sh.template file provided which also includes lots of other variables.

Here is a simple example of a spark-env.sh file to set the relevant Python environment variables:

#!/usr/bin/env bash

# This file is sourced when running various Spark programs.
export PYSPARK_PYTHON=/usr/bin/python3       
export PYSPARK_DRIVER_PYTHON=/usr/bin/ipython

In this case it sets the version of Python used by the workers/executors to Python3 and the driver version of Python to iPython for a nicer shell to work in.

If you don't already have a spark-env.sh file, and don't need to set any other variables, this one should do what you want, assuming that paths to the relevant python binaries are correct (verify with which). I had a similar problem and this fixed it.

Upvotes: 15

Related Questions