Reputation: 3015
I'm trying to run PySpark on my MacBook Air. When I try starting it up, I get the error:
Exception: Java gateway process exited before sending the driver its port number
when sc = SparkContext() is being called upon startup. I have tried running the following commands:
./bin/pyspark
./bin/spark-shell
export PYSPARK_SUBMIT_ARGS="--master local[2] pyspark-shell"
to no avail. I have also looked here:
Spark + Python - Java gateway process exited before sending the driver its port number?
but the question has never been answered. How can I fix it?
Upvotes: 141
Views: 443316
Reputation: 3081
I was having the same problem, tried many different things but the solution that worked for me was to simply install java 8 (and uninstall any other version of java if any on your machine.)
Step 1: install java 8 Download java 8 here
note: download x64 Installer for windows
Step 2: set JAVA_HOME environment variable.
for example,
set JAVA_HOME=C:\Program Files\Java\jdk1.8.0_351
That's all. With these 2 simple steps, I was able to fix the problem.
Upvotes: 1
Reputation: 201
I had the same error with PySpark, and setting JAVA_HOME to Java 11 worked for me (it was originally set to 16). I'm using macOS and PyCharm.
You can check your current Java version by doing echo $JAVA_HOME
.
Below is what worked for me. On my Mac I used the following Homebrew command, but you can use a different method to install the desired Java version, depending on your OS.
# Install Java 11 (I believe 8 works too)
brew install openjdk@11
# Set JAVA_HOME by assigning the path where your Java is
export JAVA_HOME=/usr/local/opt/openjdk@11
Note: If you installed using Homebrew and need to find the location of the path, you can do brew --prefix openjdk@11
and it should return a path like this: /usr/local/opt/openjdk@11
At this point, I could run my PySpark program from the terminal - however, my IDE (PyCharm) still had the same error until I globally changed the JAVA_HOME variable.
To update the variable, first check whether you're using the Z shell (executable zsh
) or Bash shell by running echo $SHELL
on the command line. For Z shell, you'll edit the $HOME/.zshenv file and for Bash you'll edit the $HOME/.bash_profile file.
# Open the file
vim ~/.zshenv
# Or
vim ~/.bash_profile
# Once inside the file, set the variable with your Java path, then save and close the file
export JAVA_HOME=/usr/local/opt/openjdk@11
# Test if it was set successfully
echo $JAVA_HOME
Output:
/usr/local/opt/openjdk@11
After this step, I could run PySpark through my PyCharm IDE as well.
Upvotes: 10
Reputation: 635
After spending a good amount of time with this issue, I was able to solve this. I own macOS v10.15 (Catalina), working on PyCharm in an Anaconda environment.
Spark currently supports only Java 8. If you install Java through the command line, it will by default install the latest Java 10 (or later) and would cause all sorts of troubles. To solve this, follow the below steps -
1. Make sure you have Homebrew, else install Homebrew
/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
2. Install X-code
xcode-select –-install
3. Install Java8 through the official website (not through terminal)
https://www.oracle.com/java/technologies/javase/javase-jdk8-downloads.html
4. Install Apache-Spark
brew install apache-spark
5. Install Pyspark and Findspark (if you have anaconda)
conda install -c conda-forge findspark
conda install -c conda-forge/label/gcc7 findspark
conda install -c conda-forge pyspark
Viola! This should let you run PySpark without any issues.
Upvotes: 3
Reputation: 561
Step 1:
Check the Java version on from the terminal.
java -version
If you see the bash: java: command not found
, which mean you don't have Java installed in your system.
Step 2:
Install Java using the following command,
sudo apt-get install default-jdk
Step 3:
Now check the Java version, you'll see the version have been downloaded.
java -version
Result:
openjdk version "11.0.11" 2021-04-20
OpenJDK Runtime Environment (build 11.0.11+9-Ubuntu-0ubuntu2.20.04)
OpenJDK 64-Bit Server VM (build 11.0.11+9-Ubuntu-0ubuntu2.20.04, mixed mode, sharing)
Step 4:
Now run the PySpark code, and you'll never see such an error again.
Upvotes: 1
Reputation: 347
You can simply run the following code in the terminal.
sudo apt-get install default-jdk
Upvotes: 0
Reputation: 3831
I had the same issue once when I brought up Spark using a Docker container. It turned out I set the wrong permissions for the /tmp folder.
If spark doesn't have any write permission on /tmp, it will cause this issue too.
Upvotes: 0
Reputation: 133
I met this problem and actually not due to the JAVE_HOME setting. I assume you are using Windows, and using Anaconda as your Python tools. Please check whether you can use a command prompt. I cannot run Spark due to the crash of [cmd2. After fixing this, Spark can work well on my pc.
Upvotes: 1
Reputation: 121
If you are using Jupyter Notebook from a Windows machine.
Just use the following code:
spark =SparkSession.builder.appName('myapp').getOrCreate
Don't use it like:
spark =SparkSession.builder.appName('myapp').getOrCreate()
Upvotes: -5
Reputation: 2822
This error indicates there is a mismatch between PySpark and Java version. Both of them are not compatible. See the below compatibility matrix.
PySpark Version Min Java Version
-------------------------------------
2.0.x - 2.2.x Java 7
2.3.x - 2.4.x Java 8
3.0.x - 3.1.x Java 8
3.2.x Java 11
Check your Java version. If it is 17, then you need PySpark 3.3 at least. So upgrade the PySpark.
Upvotes: 1
Reputation: 22356
The error "Java gateway process exited before sending the driver its port number" occurs in SPARK_HOME/python/lib/pyspark.zip/pyspark/java_gateway.py.
if not on_windows:
# Don't send Ctrl + C / SIGINT to the Java gateway:
def preexec_func():
signal.signal(signal.SIGINT, signal.SIG_IGN)
popen_kwargs["preexec_fn"] = preexec_func
proc = Popen(command, **popen_kwargs)
else:
# preexec_fn is not supported on Windows
proc = Popen(command, **popen_kwargs)
# Wait for the file to appear, or for the process
# to exit, whichever happens first.
while not proc.poll() and not os.path.isfile(conn_info_file):
time.sleep(0.1)
if not os.path.isfile(conn_info_file):
raise RuntimeError("Java gateway process exited before sending its port number") # <-----
The code simulates this PySpark process invocation to test if the PySpark has been started.
Run the code below to make sure PySpark is invoked. This is specific to Spark installed with Homebrew on Apple silicon, but the idea and approach will be applicable to other platforms. The Spark version is 3.3.1. I need to change the versions accordingly.
import os
import shutil
import signal
import sys
import tempfile
import time
from subprocess import Popen, PIPE
# --------------------------------------------------------------------------------
# Constant
# --------------------------------------------------------------------------------
SPARK_HOME = "/opt/homebrew/Cellar/apache-spark/3.3.1/libexec"
JAVA_HOME = '/opt/homebrew/opt/openjdk'
# --------------------------------------------------------------------------------
# Environment Variables
# NOTE:
# SPARK_HOME must be set to /opt/homebrew/Cellar/apache-spark/3.3.1/libexec",
# NOT /opt/homebrew/Cellar/apache-spark/3.3.1".
# Otherwise Java gateway process exited before sending its port number in java_gateway.py
# --------------------------------------------------------------------------------
os.environ['SPARK_HOME'] = SPARK_HOME
os.environ['JAVA_HOME'] = JAVA_HOME
sys.path.extend([
f"{SPARK_HOME}/python/lib/py4j-0.10.9.5-src.zip",
f"{SPARK_HOME}/python/lib/pyspark.zip",
])
# --------------------------------------------------------------------------------
# PySpark Modules
# --------------------------------------------------------------------------------
from pyspark.serializers import read_int, UTF8Deserializer
# --------------------------------------------------------------------------------
#
# --------------------------------------------------------------------------------
def preexec_func():
signal.signal(signal.SIGINT, signal.SIG_IGN)
def run_pyspark():
"""
"""
pyspark_command = [f'{SPARK_HOME}/bin/spark-submit', '--master', 'local[*]', 'pyspark-shell']
# Create a temporary directory where the gateway server should write the connection
# information.
proc = None
conn_info_dir = tempfile.mkdtemp()
try:
fd, conn_info_file = tempfile.mkstemp(dir=conn_info_dir)
os.close(fd)
os.unlink(conn_info_file)
env = dict(os.environ)
env["_PYSPARK_DRIVER_CONN_INFO_PATH"] = conn_info_file
# Launch the Java gateway.
popen_kwargs = {"stdin": PIPE, "env": env}
# We open a pipe to standard input, so that the Java gateway can die when the pipe is broken
# We always set the necessary environment variables.
print(f"\nrun pyspark command line {pyspark_command}")
popen_kwargs["preexec_fn"] = preexec_func
proc = Popen(pyspark_command, **popen_kwargs)
# Wait for the file to appear, or for the process to exit, whichever happens first.
count: int = 5
while not proc.poll() and not os.path.isfile(conn_info_file):
print("waiting for PySpark to start...")
count -= 1
if count < 0:
break
time.sleep(1)
if not os.path.isfile(conn_info_file):
raise RuntimeError("Java gateway process exited before sending its port number")
with open(conn_info_file, "rb") as info:
gateway_port = read_int(info)
gateway_secret = UTF8Deserializer().loads(info)
out, err = proc.communicate()
print("-"*80)
print(f"PySpark started with pid {proc.pid}")
print(f"spark process port {gateway_port}")
finally:
shutil.rmtree(conn_info_dir)
if proc:
proc.kill()
def test():
run_pyspark()
if __name__ == "__main__":
test()
JDK 8, Scala, and Spark have been installed with brew.
brew install --cask adoptopenjdk8
brew install scala
brew install apache-spark
Java process is allowed to get incoming connections at System Preferences → Security → Firewall.
Ensure the SPARK_HOME environment variable points to the directory where the tar file has been extracted. Update PYTHONPATH environment variable such that it can find the PySpark and Py4J under SPARK_HOME/python/lib. One example of doing this is shown below:
Upvotes: 1
Reputation: 169
There are many valuable hints here, however, none solved my problem completely so I will show the procedure that worked for me working in an Anaconda Jupyter Notebook on Windows:
where conda
and where python
and add the paths of the .exe files' directories to your Path variable using the Windows environmental variables tool. Add also the variables JAVA_HOME
and SPARK_HOME
there with their corresponding paths.PYSPARK_SUBMIT_ARGS
(use your own paths for SPARK_HOME
and JAVA_HOME
):import os
os.environ["SPARK_HOME"] = r"C:\Spark\spark-3.2.0-bin-hadoop3.2"
os.environ["PYSPARK_SUBMIT_ARGS"] = "--master local[3] pyspark-shell"
os.environ["JAVA_HOME"] = r"C:\Java\jre1.8.0_311"
Install findspark from the notebook with !pip install findspark
.
Run import findspark
and findspark.init()
Run from pyspark.sql import SparkSession
and spark = SparkSession.builder.getOrCreate()
Some useful links:
https://towardsdatascience.com/installing-apache-pyspark-on-windows-10-f5f0c506bea1
https://www.datacamp.com/community/tutorials/installing-anaconda-windows
Upvotes: 6
Reputation: 319
The error usually occurs when your system doesn't have java installed.
Check if you have java installed, open up the terminal and do
java --version
It's always advisable to use brew install for installing packages.
brew install openjdk@11
for installing java
Now that you have java installed, set the path globally depending on the shell you use: Z shell or bash.
export JAVA_HOME=/usr/local/opt/openjdk@11
Upvotes: 0
Reputation: 3117
I will repost how I solved it here just for future references.
How I solved my similar problem
Prerequisite:
Steps I did (NOTE: set the folder path accordingly to your system)
- set the following environment variables.
- SPARK_HOME to 'C:\spark\spark-3.0.1-bin-hadoop2.7'
- set HADOOP_HOME to 'C:\spark\spark-3.0.1-bin-hadoop2.7'
- set PYSPARK_DRIVER_PYTHON to 'jupyter'
- set PYSPARK_DRIVER_PYTHON_OPTS to 'notebook'
- add 'C:\spark\spark-3.0.1-bin-hadoop2.7\bin;' to PATH system variable.
- Change the java installed folder directly under C: (Previously java was installed under Program files, so I re-installed directly under C:)
- so my JAVA_HOME will become like this 'C:\java\jdk1.8.0_271'
now. it works !
Upvotes: 10
Reputation: 306
I was getting this error when i was using jdk-1.8 32-bit switching to 64-bit works for me.
I was getting this error because 32-bit java could not allocate more than 3G heap memory required by the spark driver (16G):
builder = SparkSession.builder \
.appName("Spark NLP") \
.master("local[*]") \
.config("spark.driver.memory", "16G") \
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
.config("spark.kryoserializer.buffer.max", "1000M") \
.config("spark.driver.maxResultSize", "0")
I tested making this up to 2G and it worked in 32-bit as well.
Upvotes: 3
Reputation: 691
Had the same issue when was trying to run the pyspark job triggered from the Airflow with remote spark.driver.host. The cause of the issue in my case was:
Exception: Java gateway process exited before sending the driver its port number
...
Exception in thread "main" java.lang.Exception: When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.
Fixed by adding exports:
export HADOOP_CONF_DIR=/etc/hadoop/conf
And the same environment variable added in the pyspark script:
import os
os.environ["HADOOP_CONF_DIR"] = '/etc/hadoop/conf'
Upvotes: 0
Reputation: 1309
This usually happens if you do not have java installed in your machine.
Go to command prompt and check the version of your java:
type : java -version
you should get output sth like this
java version "1.8.0_241"
Java(TM) SE Runtime Environment (build 1.8.0_241-b07)
Java HotSpot(TM) 64-Bit Server VM (build 25.241-b07, mixed mode)
If not, go to orcale and download jdk. Check this video on how to download java and add it to the buildpath.
https://www.youtube.com/watch?v=f7rT0h1Q5Wo
Upvotes: 1
Reputation: 2833
The error occured since JAVA is not installed on machine. Spark is developed in scala which usually runs on JAVA.
Try to install JAVA and execute the pyspark statements. It will works
Upvotes: 1
Reputation: 29407
Had this error message running pyspark on Ubuntu, got rid of it by installing the openjdk-8-jdk
package
from pyspark import SparkConf, SparkContext
sc = SparkContext(conf=SparkConf().setAppName("MyApp").setMaster("local"))
^^^ error
Install Open JDK 8:
apt-get install openjdk-8-jdk-headless -qq
Same on Mac OS, I typed in a terminal:
$ java -version
No Java runtime present, requesting install.
I was prompted to install Java from the Oracle's download site, chose the MacOS installer, clicked on jdk-13.0.2_osx-x64_bin.dmg
and after that checked that Java was installed
$ java -version
java version "13.0.2" 2020-01-14
EDIT To install JDK 8 you need to go to https://www.oracle.com/java/technologies/javase-jdk8-downloads.html (login required)
After that I was able to start a Spark context with pyspark.
In Python:
from pyspark import SparkContext
sc = SparkContext.getOrCreate()
# check that it really works by running a job
# example from http://spark.apache.org/docs/latest/rdd-programming-guide.html#parallelized-collections
data = range(10000)
distData = sc.parallelize(data)
distData.filter(lambda x: not x&1).take(10)
# Out: [0, 2, 4, 6, 8, 10, 12, 14, 16, 18]
Note that you might need to set the environment variables PYSPARK_PYTHON
and PYSPARK_DRIVER_PYTHON
and they have to be the same Python version as the Python (or IPython) you're using to run pyspark (the driver).
Upvotes: 30
Reputation: 119
Spark is very picky with the Java version you use. It is highly recommended that you use Java 1.8 (The open source AdoptOpenJDK 8 works well too).
After install it, set JAVA_HOME
to your bash variables, if you use Mac/Linux:
export JAVA_HOME=$(/usr/libexec/java_home -v 1.8)
export PATH=$JAVA_HOME/bin:$PATH
Upvotes: 5
Reputation: 379
I go this error fixed by using the below code. I had setup the SPARK_HOME though. You may follow this simple steps from eproblems website
spark_home = os.environ.get('SPARK_HOME', None)
Upvotes: 0
Reputation: 91
There are so many reasons for this error. My reason is : the version of pyspark is incompatible with spark. pyspark version :2.4.0, but spark version is 2.2.0. it always cause python always fail when starting spark process. then spark cannot tell its ports to python. so error will be "Pyspark: Exception: Java gateway process exited before sending the driver its port number ".
I suggest you dive into source code to find out the real reasons when this error happens
Upvotes: 0
Reputation: 1179
For Linux (Ubuntu 18.04) with a JAVA_HOME issue, a key is to point it to the master folder:
sudo update-alternatives --config java
. If Jave 8 is not installed, install by: sudo apt install openjdk-8-jdk
.JAVA_HOME
environment variable as the master java 8 folder. The location is given by the first command above removing jre/bin/java
. Namely: export JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64/"
. If done on the command line, this will be relevant only for the current session (ref: export command on Linux). To verify: echo $JAVA_HOME
..bashrc
. This file loads when a bash is started interactively ref: .bashrcUpvotes: 2
Reputation: 41
I had the same exception and I tried everything by setting and resetting all environment variables. But the issue in the end drilled down to space in appname property of spark session,that is, "SparkSession.builder.appName("StreamingDemo").getOrCreate()". Immediately after removing space from string given to appname property it got resolved.I was using pyspark 2.7 with eclipse on windows 10 environment. It worked for me.
Enclosed are required screenshots.
Upvotes: 1
Reputation: 1263
In my case it was because I wrote SPARK_DRIVER_MEMORY=10
instead of SPARK_DRIVER_MEMORY=10g
in spark-env.sh
Upvotes: 1
Reputation: 356
I use Mac OS. I fixed the problem!
Below is how I fixed it.
JDK8 seems works fine. (https://github.com/jupyter/jupyter/issues/248)
So I checked my JDK /Library/Java/JavaVirtualMachines, I only have jdk-11.jdk in this path.
I downloaded JDK8 (I followed the link). Which is:
brew tap caskroom/versions
brew cask install java8
After this, I added
export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_202.jdk/Contents/Home
export JAVA_HOME="$(/usr/libexec/java_home -v 1.8)"
to ~/.bash_profile file. (you sholud check your jdk1.8 file name)
It works now! Hope this help :)
Upvotes: 12
Reputation: 1650
If you are trying to run spark without hadoop binaries, you might encounter the above mentioned error. One solution is to :
1) download hadoop separatedly.
2) add hadoop to your PATH
3) add hadoop classpath to your SPARK install
The first two steps are trivial, the last step can be best done by adding the following in the $SPARK_HOME/conf/spark-env.sh in each spark node (master and workers)
### in conf/spark-env.sh ###
export SPARK_DIST_CLASSPATH=$(hadoop classpath)
for more info also check: https://spark.apache.org/docs/latest/hadoop-provided.html
Upvotes: 3
Reputation: 2133
Make sure that both your Java directory (as found in your path) AND your Python interpreter reside in directories with no spaces in them. These were the cause of my problem.
Upvotes: 0
Reputation: 400
This is an old thread but I'm adding my solution for those who use mac.
The issue was with the JAVA_HOME
. You have to include this in your .bash_profile
.
Check your java -version
. If you downloaded the latest Java but it doesn't show up as the latest version, then you know that the path is wrong. Normally, the default path is export JAVA_HOME= /usr/bin/java
.
So try changing the path to:
/Library/Internet\ Plug-Ins/JavaAppletPlugin.plugin/Contents/Home/bin/java
Alternatively you could also download the latest JDK.
https://www.oracle.com/technetwork/java/javase/downloads/index.html and this will automatically replace usr/bin/java
to the latest version. You can confirm this by doing java -version
again.
Then that should work.
Upvotes: 0
Reputation: 1
For me, the answer was to add two 'Content Roots' in 'File' -> 'Project Structure' -> 'Modules' (in IntelliJ):
Upvotes: 0
Reputation: 63
I have the same error in running pyspark in pycharm. I solved the problem by adding JAVA_HOME in pycharm's environment variables.
Upvotes: 1