mt88
mt88

Reputation: 3015

PySpark: "Exception: Java gateway process exited before sending the driver its port number"

I'm trying to run PySpark on my MacBook Air. When I try starting it up, I get the error:

Exception: Java gateway process exited before sending the driver its port number

when sc = SparkContext() is being called upon startup. I have tried running the following commands:

./bin/pyspark
./bin/spark-shell
export PYSPARK_SUBMIT_ARGS="--master local[2] pyspark-shell"

to no avail. I have also looked here:

Spark + Python - Java gateway process exited before sending the driver its port number?

but the question has never been answered. How can I fix it?

Upvotes: 141

Views: 443316

Answers (30)

Ajeet Verma
Ajeet Verma

Reputation: 3081

I was having the same problem, tried many different things but the solution that worked for me was to simply install java 8 (and uninstall any other version of java if any on your machine.)

Step 1: install java 8 Download java 8 here

note: download x64 Installer for windows enter image description here Step 2: set JAVA_HOME environment variable.

for example,

set JAVA_HOME=C:\Program Files\Java\jdk1.8.0_351 

That's all. With these 2 simple steps, I was able to fix the problem.

Upvotes: 1

answerzilla
answerzilla

Reputation: 201

I had the same error with PySpark, and setting JAVA_HOME to Java 11 worked for me (it was originally set to 16). I'm using macOS and PyCharm. You can check your current Java version by doing echo $JAVA_HOME.

Below is what worked for me. On my Mac I used the following Homebrew command, but you can use a different method to install the desired Java version, depending on your OS.

# Install Java 11 (I believe 8 works too)
brew install openjdk@11

# Set JAVA_HOME by assigning the path where your Java is
export JAVA_HOME=/usr/local/opt/openjdk@11

Note: If you installed using Homebrew and need to find the location of the path, you can do brew --prefix openjdk@11 and it should return a path like this: /usr/local/opt/openjdk@11

At this point, I could run my PySpark program from the terminal - however, my IDE (PyCharm) still had the same error until I globally changed the JAVA_HOME variable.

To update the variable, first check whether you're using the Z shell (executable zsh) or Bash shell by running echo $SHELL on the command line. For Z shell, you'll edit the $HOME/.zshenv file and for Bash you'll edit the $HOME/.bash_profile file.

# Open the file
vim ~/.zshenv

# Or

vim ~/.bash_profile

# Once inside the file, set the variable with your Java path, then save and close the file
export JAVA_HOME=/usr/local/opt/openjdk@11

# Test if it was set successfully
echo $JAVA_HOME

Output:

/usr/local/opt/openjdk@11

After this step, I could run PySpark through my PyCharm IDE as well.

Upvotes: 10

Sahana M
Sahana M

Reputation: 635

After spending a good amount of time with this issue, I was able to solve this. I own macOS v10.15 (Catalina), working on PyCharm in an Anaconda environment.

Spark currently supports only Java 8. If you install Java through the command line, it will by default install the latest Java 10 (or later) and would cause all sorts of troubles. To solve this, follow the below steps -

1. Make sure you have Homebrew, else install Homebrew
/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

2. Install X-code
xcode-select –-install

3. Install Java8 through the official website (not through terminal)
https://www.oracle.com/java/technologies/javase/javase-jdk8-downloads.html

4. Install Apache-Spark
 brew install apache-spark

5. Install Pyspark and Findspark (if you have anaconda)
conda install -c conda-forge findspark
conda install -c conda-forge/label/gcc7 findspark
conda install -c conda-forge pyspark

Viola! This should let you run PySpark without any issues.

Upvotes: 3

Shritam Kumar Mund
Shritam Kumar Mund

Reputation: 561

Step 1:

Check the Java version on from the terminal.

java -version

If you see the bash: java: command not found, which mean you don't have Java installed in your system.

Step 2:

Install Java using the following command,

sudo apt-get install default-jdk

Step 3:

Now check the Java version, you'll see the version have been downloaded.

java -version

Result:

openjdk version "11.0.11" 2021-04-20
OpenJDK Runtime Environment (build 11.0.11+9-Ubuntu-0ubuntu2.20.04)
OpenJDK 64-Bit Server VM (build 11.0.11+9-Ubuntu-0ubuntu2.20.04, mixed mode, sharing)

Step 4:

Now run the PySpark code, and you'll never see such an error again.

Upvotes: 1

Nisan Chhetri
Nisan Chhetri

Reputation: 347

You can simply run the following code in the terminal.

sudo apt-get install default-jdk

Upvotes: 0

kennyut
kennyut

Reputation: 3831

I had the same issue once when I brought up Spark using a Docker container. It turned out I set the wrong permissions for the /tmp folder.

If spark doesn't have any write permission on /tmp, it will cause this issue too.

Upvotes: 0

Ray
Ray

Reputation: 133

I met this problem and actually not due to the JAVE_HOME setting. I assume you are using Windows, and using Anaconda as your Python tools. Please check whether you can use a command prompt. I cannot run Spark due to the crash of [cmd2. After fixing this, Spark can work well on my pc.

Upvotes: 1

Shashi Kumar Singh
Shashi Kumar Singh

Reputation: 121

If you are using Jupyter Notebook from a Windows machine.

Just use the following code:

spark =SparkSession.builder.appName('myapp').getOrCreate

Don't use it like:

spark =SparkSession.builder.appName('myapp').getOrCreate()

Upvotes: -5

s510
s510

Reputation: 2822

This error indicates there is a mismatch between PySpark and Java version. Both of them are not compatible. See the below compatibility matrix.

PySpark Version   Min Java Version
-------------------------------------
2.0.x - 2.2.x     Java 7
2.3.x - 2.4.x     Java 8
3.0.x - 3.1.x     Java 8
3.2.x             Java 11

Check your Java version. If it is 17, then you need PySpark 3.3 at least. So upgrade the PySpark.

Upvotes: 1

mon
mon

Reputation: 22356

Background

The error "Java gateway process exited before sending the driver its port number" occurs in SPARK_HOME/python/lib/pyspark.zip/pyspark/java_gateway.py.

    if not on_windows:
        # Don't send Ctrl + C / SIGINT to the Java gateway:
        def preexec_func():
            signal.signal(signal.SIGINT, signal.SIG_IGN)

        popen_kwargs["preexec_fn"] = preexec_func
        proc = Popen(command, **popen_kwargs)
    else:
        # preexec_fn is not supported on Windows
        proc = Popen(command, **popen_kwargs)

    # Wait for the file to appear, or for the process
    # to exit, whichever happens first.
    while not proc.poll() and not os.path.isfile(conn_info_file):
        time.sleep(0.1)

    if not os.path.isfile(conn_info_file):
        raise RuntimeError("Java gateway process exited before sending its port number")      # <-----

The code simulates this PySpark process invocation to test if the PySpark has been started.

Test

Run the code below to make sure PySpark is invoked. This is specific to Spark installed with Homebrew on Apple silicon, but the idea and approach will be applicable to other platforms. The Spark version is 3.3.1. I need to change the versions accordingly.

import os
import shutil
import signal
import sys
import tempfile
import time
from subprocess import Popen, PIPE

# --------------------------------------------------------------------------------
# Constant
# --------------------------------------------------------------------------------
SPARK_HOME = "/opt/homebrew/Cellar/apache-spark/3.3.1/libexec"
JAVA_HOME = '/opt/homebrew/opt/openjdk'

# --------------------------------------------------------------------------------
# Environment Variables
# NOTE:
# SPARK_HOME must be set to /opt/homebrew/Cellar/apache-spark/3.3.1/libexec",
# NOT /opt/homebrew/Cellar/apache-spark/3.3.1".
# Otherwise Java gateway process exited before sending its port number in java_gateway.py
# --------------------------------------------------------------------------------
os.environ['SPARK_HOME'] = SPARK_HOME
os.environ['JAVA_HOME'] = JAVA_HOME
sys.path.extend([
    f"{SPARK_HOME}/python/lib/py4j-0.10.9.5-src.zip",
    f"{SPARK_HOME}/python/lib/pyspark.zip",
])


# --------------------------------------------------------------------------------
# PySpark Modules
# --------------------------------------------------------------------------------
from pyspark.serializers import read_int, UTF8Deserializer


# --------------------------------------------------------------------------------
#
# --------------------------------------------------------------------------------
def preexec_func():
    signal.signal(signal.SIGINT, signal.SIG_IGN)


def run_pyspark():
    """
    """
    pyspark_command = [f'{SPARK_HOME}/bin/spark-submit', '--master', 'local[*]', 'pyspark-shell']

    # Create a temporary directory where the gateway server should write the connection
    # information.
    proc = None
    conn_info_dir = tempfile.mkdtemp()
    try:
        fd, conn_info_file = tempfile.mkstemp(dir=conn_info_dir)
        os.close(fd)
        os.unlink(conn_info_file)

        env = dict(os.environ)
        env["_PYSPARK_DRIVER_CONN_INFO_PATH"] = conn_info_file

        # Launch the Java gateway.
        popen_kwargs = {"stdin": PIPE, "env": env}
        # We open a pipe to standard input, so that the Java gateway can die when the pipe is broken
        # We always set the necessary environment variables.

        print(f"\nrun pyspark command line {pyspark_command}")
        popen_kwargs["preexec_fn"] = preexec_func
        proc = Popen(pyspark_command, **popen_kwargs)

        # Wait for the file to appear, or for the process to exit, whichever happens first.
        count: int = 5
        while not proc.poll() and not os.path.isfile(conn_info_file):
            print("waiting for PySpark to start...")
            count -= 1
            if count < 0:
                break

            time.sleep(1)

        if not os.path.isfile(conn_info_file):
            raise RuntimeError("Java gateway process exited before sending its port number")

        with open(conn_info_file, "rb") as info:
            gateway_port = read_int(info)
            gateway_secret = UTF8Deserializer().loads(info)

        out, err = proc.communicate()
        print("-"*80)
        print(f"PySpark started with pid {proc.pid}")
        print(f"spark process port {gateway_port}")
    finally:
        shutil.rmtree(conn_info_dir)
        if proc:
            proc.kill()


def test():
    run_pyspark()


if __name__ == "__main__":
    test()

Environment

  • macOS v13.0.1 (Ventura)
  • Python 3.9.13
  • PySpark 3.3.1
  • OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_292-b10)
  • Scala code runner version 3.2.2

JDK 8, Scala, and Spark have been installed with brew.

brew install --cask adoptopenjdk8
brew install scala
brew install apache-spark

Java process is allowed to get incoming connections at System PreferencesSecurityFirewall.

Enter image description here

References

Ensure the SPARK_HOME environment variable points to the directory where the tar file has been extracted. Update PYTHONPATH environment variable such that it can find the PySpark and Py4J under SPARK_HOME/python/lib. One example of doing this is shown below:

Upvotes: 1

CAV
CAV

Reputation: 169

There are many valuable hints here, however, none solved my problem completely so I will show the procedure that worked for me working in an Anaconda Jupyter Notebook on Windows:

  • Download and install java and pyspark in directories without blank spaces.
  • [maybe unnecessary] In the anaconda prompt, type where conda and where python and add the paths of the .exe files' directories to your Path variable using the Windows environmental variables tool. Add also the variables JAVA_HOME and SPARK_HOME there with their corresponding paths.
  • Even doing so, I had to set these variables manually from within the Notebook along with PYSPARK_SUBMIT_ARGS (use your own paths for SPARK_HOME and JAVA_HOME):

import os
os.environ["SPARK_HOME"] = r"C:\Spark\spark-3.2.0-bin-hadoop3.2"
os.environ["PYSPARK_SUBMIT_ARGS"] = "--master local[3] pyspark-shell"
os.environ["JAVA_HOME"] = r"C:\Java\jre1.8.0_311"

  • Install findspark from the notebook with !pip install findspark.

  • Run import findspark and findspark.init()

  • Run from pyspark.sql import SparkSession and spark = SparkSession.builder.getOrCreate()

Some useful links:

https://towardsdatascience.com/installing-apache-pyspark-on-windows-10-f5f0c506bea1

https://sparkbyexamples.com/pyspark/pyspark-exception-java-gateway-process-exited-before-sending-the-driver-its-port-number/

https://www.datacamp.com/community/tutorials/installing-anaconda-windows

Upvotes: 6

archit jain
archit jain

Reputation: 319

The error usually occurs when your system doesn't have java installed.

Check if you have java installed, open up the terminal and do java --version

It's always advisable to use brew install for installing packages. brew install openjdk@11 for installing java

Now that you have java installed, set the path globally depending on the shell you use: Z shell or bash.

  1. cmd + shift + H: Go to home
  2. cmd + shift + [.]: To see the hidden files (zshenv or bash_profile) and save either of the file under export JAVA_HOME=/usr/local/opt/openjdk@11

Upvotes: 0

kitokid
kitokid

Reputation: 3117

I will repost how I solved it here just for future references.

How I solved my similar problem

Prerequisite:

  1. anaconda already installed
  2. Spark already installed (https://spark.apache.org/downloads.html)
  3. pyspark already installed (https://anaconda.org/conda-forge/pyspark)

Steps I did (NOTE: set the folder path accordingly to your system)

  1. set the following environment variables.
  2. SPARK_HOME to 'C:\spark\spark-3.0.1-bin-hadoop2.7'
  3. set HADOOP_HOME to 'C:\spark\spark-3.0.1-bin-hadoop2.7'
  4. set PYSPARK_DRIVER_PYTHON to 'jupyter'
  5. set PYSPARK_DRIVER_PYTHON_OPTS to 'notebook'
  6. add 'C:\spark\spark-3.0.1-bin-hadoop2.7\bin;' to PATH system variable.
  7. Change the java installed folder directly under C: (Previously java was installed under Program files, so I re-installed directly under C:)
  8. so my JAVA_HOME will become like this 'C:\java\jdk1.8.0_271'

now. it works !

Upvotes: 10

muzamil
muzamil

Reputation: 306

I was getting this error when i was using jdk-1.8 32-bit switching to 64-bit works for me.

I was getting this error because 32-bit java could not allocate more than 3G heap memory required by the spark driver (16G):

builder = SparkSession.builder \
        .appName("Spark NLP") \
        .master("local[*]") \
        .config("spark.driver.memory", "16G") \
        .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
        .config("spark.kryoserializer.buffer.max", "1000M") \
        .config("spark.driver.maxResultSize", "0")

I tested making this up to 2G and it worked in 32-bit as well.

Upvotes: 3

Artyom Rebrov
Artyom Rebrov

Reputation: 691

Had the same issue when was trying to run the pyspark job triggered from the Airflow with remote spark.driver.host. The cause of the issue in my case was:

Exception: Java gateway process exited before sending the driver its port number

...

Exception in thread "main" java.lang.Exception: When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.

Fixed by adding exports:

export HADOOP_CONF_DIR=/etc/hadoop/conf

And the same environment variable added in the pyspark script:

import os
os.environ["HADOOP_CONF_DIR"] = '/etc/hadoop/conf'

Upvotes: 0

Arjjun
Arjjun

Reputation: 1309

This usually happens if you do not have java installed in your machine.

Go to command prompt and check the version of your java: type : java -version

you should get output sth like this

java version "1.8.0_241" Java(TM) SE Runtime Environment (build 1.8.0_241-b07) Java HotSpot(TM) 64-Bit Server VM (build 25.241-b07, mixed mode)

If not, go to orcale and download jdk. Check this video on how to download java and add it to the buildpath.

https://www.youtube.com/watch?v=f7rT0h1Q5Wo

Upvotes: 1

Tarun Reddy
Tarun Reddy

Reputation: 2833

The error occured since JAVA is not installed on machine. Spark is developed in scala which usually runs on JAVA.

Try to install JAVA and execute the pyspark statements. It will works

Upvotes: 1

user2314737
user2314737

Reputation: 29407

Had this error message running pyspark on Ubuntu, got rid of it by installing the openjdk-8-jdk package

from pyspark import SparkConf, SparkContext
sc = SparkContext(conf=SparkConf().setAppName("MyApp").setMaster("local"))
^^^ error

Install Open JDK 8:

apt-get install openjdk-8-jdk-headless -qq    

On MacOS

Same on Mac OS, I typed in a terminal:

$ java -version
No Java runtime present, requesting install. 

I was prompted to install Java from the Oracle's download site, chose the MacOS installer, clicked on jdk-13.0.2_osx-x64_bin.dmg and after that checked that Java was installed

$ java -version
java version "13.0.2" 2020-01-14

EDIT To install JDK 8 you need to go to https://www.oracle.com/java/technologies/javase-jdk8-downloads.html (login required)

After that I was able to start a Spark context with pyspark.

Checking if it works

In Python:

from pyspark import SparkContext 
sc = SparkContext.getOrCreate() 

# check that it really works by running a job
# example from http://spark.apache.org/docs/latest/rdd-programming-guide.html#parallelized-collections
data = range(10000) 
distData = sc.parallelize(data)
distData.filter(lambda x: not x&1).take(10)
# Out: [0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

Note that you might need to set the environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON and they have to be the same Python version as the Python (or IPython) you're using to run pyspark (the driver).

Upvotes: 30

Marcelo Tournier
Marcelo Tournier

Reputation: 119

Spark is very picky with the Java version you use. It is highly recommended that you use Java 1.8 (The open source AdoptOpenJDK 8 works well too). After install it, set JAVA_HOME to your bash variables, if you use Mac/Linux:

export JAVA_HOME=$(/usr/libexec/java_home -v 1.8)

export PATH=$JAVA_HOME/bin:$PATH

Upvotes: 5

abhishek kumar
abhishek kumar

Reputation: 379

I go this error fixed by using the below code. I had setup the SPARK_HOME though. You may follow this simple steps from eproblems website

spark_home = os.environ.get('SPARK_HOME', None)

Upvotes: 0

ZhangXu
ZhangXu

Reputation: 91

There are so many reasons for this error. My reason is : the version of pyspark is incompatible with spark. pyspark version :2.4.0, but spark version is 2.2.0. it always cause python always fail when starting spark process. then spark cannot tell its ports to python. so error will be "Pyspark: Exception: Java gateway process exited before sending the driver its port number ".

I suggest you dive into source code to find out the real reasons when this error happens

Upvotes: 0

Ran Feldesh
Ran Feldesh

Reputation: 1179

For Linux (Ubuntu 18.04) with a JAVA_HOME issue, a key is to point it to the master folder:

  1. Set Java 8 as default by: sudo update-alternatives --config java. If Jave 8 is not installed, install by: sudo apt install openjdk-8-jdk.
  2. Set JAVA_HOME environment variable as the master java 8 folder. The location is given by the first command above removing jre/bin/java. Namely: export JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64/". If done on the command line, this will be relevant only for the current session (ref: export command on Linux). To verify: echo $JAVA_HOME.
  3. In order to have this permanently set, add the bolded line above to a file that runs before you start your IDE/Jupyter/python interpreter. This could be by adding the bolded line above to .bashrc. This file loads when a bash is started interactively ref: .bashrc

Upvotes: 2

A known
A known

Reputation: 41

I had the same exception and I tried everything by setting and resetting all environment variables. But the issue in the end drilled down to space in appname property of spark session,that is, "SparkSession.builder.appName("StreamingDemo").getOrCreate()". Immediately after removing space from string given to appname property it got resolved.I was using pyspark 2.7 with eclipse on windows 10 environment. It worked for me. Enclosed are required screenshots.Error_with space

No Error_without space

Upvotes: 1

hayj
hayj

Reputation: 1263

In my case it was because I wrote SPARK_DRIVER_MEMORY=10 instead of SPARK_DRIVER_MEMORY=10g in spark-env.sh

Upvotes: 1

shihs
shihs

Reputation: 356

I use Mac OS. I fixed the problem!

Below is how I fixed it.

JDK8 seems works fine. (https://github.com/jupyter/jupyter/issues/248)

So I checked my JDK /Library/Java/JavaVirtualMachines, I only have jdk-11.jdk in this path.

I downloaded JDK8 (I followed the link). Which is:

brew tap caskroom/versions
brew cask install java8

After this, I added

export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_202.jdk/Contents/Home
export JAVA_HOME="$(/usr/libexec/java_home -v 1.8)"

to ~/.bash_profile file. (you sholud check your jdk1.8 file name)

It works now! Hope this help :)

Upvotes: 12

Nate Busa
Nate Busa

Reputation: 1650

If you are trying to run spark without hadoop binaries, you might encounter the above mentioned error. One solution is to :

1) download hadoop separatedly.
2) add hadoop to your PATH
3) add hadoop classpath to your SPARK install

The first two steps are trivial, the last step can be best done by adding the following in the $SPARK_HOME/conf/spark-env.sh in each spark node (master and workers)

### in conf/spark-env.sh ###

export SPARK_DIST_CLASSPATH=$(hadoop classpath)

for more info also check: https://spark.apache.org/docs/latest/hadoop-provided.html

Upvotes: 3

Steven
Steven

Reputation: 2133

Make sure that both your Java directory (as found in your path) AND your Python interpreter reside in directories with no spaces in them. These were the cause of my problem.

Upvotes: 0

noiivice
noiivice

Reputation: 400

This is an old thread but I'm adding my solution for those who use mac.

The issue was with the JAVA_HOME. You have to include this in your .bash_profile.

Check your java -version. If you downloaded the latest Java but it doesn't show up as the latest version, then you know that the path is wrong. Normally, the default path is export JAVA_HOME= /usr/bin/java.

So try changing the path to: /Library/Internet\ Plug-Ins/JavaAppletPlugin.plugin/Contents/Home/bin/java

Alternatively you could also download the latest JDK. https://www.oracle.com/technetwork/java/javase/downloads/index.html and this will automatically replace usr/bin/java to the latest version. You can confirm this by doing java -version again.

Then that should work.

Upvotes: 0

Yuuura87
Yuuura87

Reputation: 1

For me, the answer was to add two 'Content Roots' in 'File' -> 'Project Structure' -> 'Modules' (in IntelliJ):

  1. YourPath\spark-2.2.1-bin-hadoop2.7\python
  2. YourPath\spark-2.2.1-bin-hadoop2.7\python\lib\py4j-0.10.4-src.zip

Upvotes: 0

Joon
Joon

Reputation: 63

I have the same error in running pyspark in pycharm. I solved the problem by adding JAVA_HOME in pycharm's environment variables.

Upvotes: 1

Related Questions