Reputation: 343
So, I am trying to create a Spark session in Python 2.7 using the following:
#Initialize SparkSession and SparkContext
from pyspark.sql import SparkSession
from pyspark import SparkContext
#Create a Spark Session
SpSession = SparkSession \
.builder \
.master("local[2]") \
.appName("V2 Maestros") \
.config("spark.executor.memory", "1g") \
.config("spark.cores.max","2") \
.config("spark.sql.warehouse.dir", "file:///c:/temp/spark-warehouse")\
.getOrCreate()
#Get the Spark Context from Spark Session
SpContext = SpSession.sparkContext
I get the following error pointing to the python\lib\pyspark.zip\pyspark\java_gateway.py
path`
Exception: Java gateway process exited before sending the driver its port number
Tried to look into the java_gateway.py file, with the following contents:
import atexit
import os
import sys
import select
import signal
import shlex
import socket
import platform
from subprocess import Popen, PIPE
if sys.version >= '3':
xrange = range
from py4j.java_gateway import java_import, JavaGateway, GatewayClient
from py4j.java_collections import ListConverter
from pyspark.serializers import read_int
# patching ListConverter, or it will convert bytearray into Java ArrayList
def can_convert_list(self, obj):
return isinstance(obj, (list, tuple, xrange))
ListConverter.can_convert = can_convert_list
def launch_gateway():
if "PYSPARK_GATEWAY_PORT" in os.environ:
gateway_port = int(os.environ["PYSPARK_GATEWAY_PORT"])
else:
SPARK_HOME = os.environ["SPARK_HOME"]
# Launch the Py4j gateway using Spark's run command so that we pick up the
# proper classpath and settings from spark-env.sh
on_windows = platform.system() == "Windows"
script = "./bin/spark-submit.cmd" if on_windows else "./bin/spark-submit"
submit_args = os.environ.get("PYSPARK_SUBMIT_ARGS", "pyspark-shell")
if os.environ.get("SPARK_TESTING"):
submit_args = ' '.join([
"--conf spark.ui.enabled=false",
submit_args
])
command = [os.path.join(SPARK_HOME, script)] + shlex.split(submit_args)
# Start a socket that will be used by PythonGatewayServer to communicate its port to us
callback_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
callback_socket.bind(('127.0.0.1', 0))
callback_socket.listen(1)
callback_host, callback_port = callback_socket.getsockname()
env = dict(os.environ)
env['_PYSPARK_DRIVER_CALLBACK_HOST'] = callback_host
env['_PYSPARK_DRIVER_CALLBACK_PORT'] = str(callback_port)
# Launch the Java gateway.
# We open a pipe to stdin so that the Java gateway can die when the pipe is broken
if not on_windows:
# Don't send ctrl-c / SIGINT to the Java gateway:
def preexec_func():
signal.signal(signal.SIGINT, signal.SIG_IGN)
proc = Popen(command, stdin=PIPE, preexec_fn=preexec_func, env=env)
else:
# preexec_fn not supported on Windows
proc = Popen(command, stdin=PIPE, env=env)
gateway_port = None
# We use select() here in order to avoid blocking indefinitely if the subprocess dies
# before connecting
while gateway_port is None and proc.poll() is None:
timeout = 1 # (seconds)
readable, _, _ = select.select([callback_socket], [], [], timeout)
if callback_socket in readable:
gateway_connection = callback_socket.accept()[0]
# Determine which ephemeral port the server started on:
gateway_port = read_int(gateway_connection.makefile(mode="rb"))
gateway_connection.close()
callback_socket.close()
if gateway_port is None:
raise Exception("Java gateway process exited before sending the driver its port number")
# In Windows, ensure the Java child processes do not linger after Python has exited.
# In UNIX-based systems, the child process can kill itself on broken pipe (i.e. when
# the parent process' stdin sends an EOF). In Windows, however, this is not possible
# because java.lang.Process reads directly from the parent process' stdin, contending
# with any opportunity to read an EOF from the parent. Note that this is only best
# effort and will not take effect if the python process is violently terminated.
if on_windows:
# In Windows, the child process here is "spark-submit.cmd", not the JVM itself
# (because the UNIX "exec" command is not available). This means we cannot simply
# call proc.kill(), which kills only the "spark-submit.cmd" process but not the
# JVMs. Instead, we use "taskkill" with the tree-kill option "/t" to terminate all
# child processes in the tree (http://technet.microsoft.com/en-us/library/bb491009.aspx)
def killChild():
Popen(["cmd", "/c", "taskkill", "/f", "/t", "/pid", str(proc.pid)])
atexit.register(killChild)
# Connect to the gateway
gateway = JavaGateway(GatewayClient(port=gateway_port), auto_convert=True)
# Import the classes used by PySpark
java_import(gateway.jvm, "org.apache.spark.SparkConf")
java_import(gateway.jvm, "org.apache.spark.api.java.*")
java_import(gateway.jvm, "org.apache.spark.api.python.*")
java_import(gateway.jvm, "org.apache.spark.ml.python.*")
java_import(gateway.jvm, "org.apache.spark.mllib.api.python.*")
# TODO(davies): move into sql
java_import(gateway.jvm, "org.apache.spark.sql.*")
java_import(gateway.jvm, "org.apache.spark.sql.hive.*")
java_import(gateway.jvm, "scala.Tuple2")
return gateway
I am pretty new to Spark and Pyspark, hence unable to debug the issue here. I also tried to look at some other suggestions: Spark + Python - Java gateway process exited before sending the driver its port number? and Pyspark: Exception: Java gateway process exited before sending the driver its port number
but unable to resolve this so far. Please help!
Here is how the spark environment looks like:
# This script loads spark-env.sh if it exists, and ensures it is only loaded once.
# spark-env.sh is loaded from SPARK_CONF_DIR if set, or within the current directory's
# conf/ subdirectory.
# Figure out where Spark is installed
if [ -z "${SPARK_HOME}" ]; then
export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
fi
if [ -z "$SPARK_ENV_LOADED" ]; then
export SPARK_ENV_LOADED=1
# Returns the parent of the directory this script lives in.
parent_dir="${SPARK_HOME}"
user_conf_dir="${SPARK_CONF_DIR:-"$parent_dir"/conf}"
if [ -f "${user_conf_dir}/spark-env.sh" ]; then
# Promote all variable declarations to environment (exported) variables
set -a
. "${user_conf_dir}/spark-env.sh"
set +a
fi
fi
# Setting SPARK_SCALA_VERSION if not already set.
if [ -z "$SPARK_SCALA_VERSION" ]; then
ASSEMBLY_DIR2="${SPARK_HOME}/assembly/target/scala-2.11"
ASSEMBLY_DIR1="${SPARK_HOME}/assembly/target/scala-2.10"
if [[ -d "$ASSEMBLY_DIR2" && -d "$ASSEMBLY_DIR1" ]]; then
echo -e "Presence of build for both scala versions(SCALA 2.10 and SCALA 2.11) detected." 1>&2
echo -e 'Either clean one of them or, export SPARK_SCALA_VERSION=2.11 in spark-env.sh.' 1>&2
exit 1
fi
if [ -d "$ASSEMBLY_DIR2" ]; then
export SPARK_SCALA_VERSION="2.11"
else
export SPARK_SCALA_VERSION="2.10"
fi
fi
Here is how my Spark environment is set up in Python:
import os
import sys
# NOTE: Please change the folder paths to your current setup.
#Windows
if sys.platform.startswith('win'):
#Where you downloaded the resource bundle
os.chdir("E:/Udemy - Spark/SparkPythonDoBigDataAnalytics-Resources")
#Where you installed spark.
os.environ['SPARK_HOME'] = 'E:/Udemy - Spark/Apache Spark/spark-2.0.0-bin-hadoop2.7'
#other platforms - linux/mac
else:
os.chdir("/Users/kponnambalam/Dropbox/V2Maestros/Modules/Apache Spark/Python")
os.environ['SPARK_HOME'] = '/users/kponnambalam/products/spark-2.0.0-bin-hadoop2.7'
os.curdir
# Create a variable for our root path
SPARK_HOME = os.environ['SPARK_HOME']
# Create a variable for our root path
SPARK_HOME = os.environ['SPARK_HOME']
#Add the following paths to the system path. Please check your installation
#to make sure that these zip files actually exist. The names might change
#as versions change.
sys.path.insert(0,os.path.join(SPARK_HOME,"python"))
sys.path.insert(0,os.path.join(SPARK_HOME,"python","lib"))
sys.path.insert(0,os.path.join(SPARK_HOME,"python","lib","pyspark.zip"))
sys.path.insert(0,os.path.join(SPARK_HOME,"python","lib","py4j-0.10.1-src.zip"))
#Initialize SparkSession and SparkContext
from pyspark.sql import SparkSession
from pyspark import SparkContext
Upvotes: 8
Views: 41696
Reputation: 1
This error message occurred for me while I tried to run spark code inside a docker container.
I simply added a java installation steps in docker file and resolved the issue,
added steps:
# Install OpenJDK-11
RUN apt-get update && \
apt-get install -y openjdk-11-jre-headless && \
apt-get clean;
Upvotes: 0
Reputation: 619
Spark will not work fine with java version greater than 11, downgrade java to version 8 or 11 and set JAVA_HOME.
Upvotes: -1
Reputation: 133
I have the same problem.
Luckily I found the reason.
from pyspark.sql import SparkSession
# spark = SparkSession.builder.appName('Check Pyspark').master("local").getOrCreate()
spark = SparkSession.builder.appName('CheckPyspark').master("local").getOrCreate()
print spark.sparkContext.parallelize(range(6), 3).collect()
Notice the difference between the second line and the third line.
If the parameter after the AppName like this 'Check Pyspark',you will get error(Exception: Java gateway process...).
The parameter after the AppName can not has blank space. Should chagne 'Check Pyspark' to 'CheckPyspark'.
Upvotes: 1
Reputation: 629
After reading many posts I finally made Spark work on my Windows laptop. I use Anaconda Python, but I am sure this will work with standard distibution too.
So, you need to make sure you can run Spark independently. My assumptions are that you have valid python path and Java installed. For Java I had "C:\ProgramData\Oracle\Java\javapath" defined in my Path which redirects to my Java8 bin folder.
Go to %SPARK_HOME%\bin and try to run pyspark which is Python Spark shell. If your environment is like mine you will see exeption about inability to find winutils and hadoop. Second exception will be about missing Hive:
pyspark.sql.utils.IllegalArgumentException: u"Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder':"
I then found and simply followed https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-tips-and-tricks-running-spark-windows.html Specifically:
winutils.exe chmod -R 777 C:\tmp\hive
in cmd in admin mode.Hope that helps and you can enjoy running Spark code locally.
Upvotes: 6
Reputation: 11
I had the exact same issue after playing around with my JAVA_HOME system environmental variables on Windows 10 using python 2.7: I tried to run the same configuration script for Pyspark (Based on the V2-Maestros Udemy course) with the same error message "Java gateway process exited before sending the driver its port number".
After several attempts to fix the problem, the only solution that ended up working was to uninstall all versions of java (had three of them) from my machine, and deleting the JAVA_HOME system variable as well as the record from the PATH system variable related to JAVA_HOME; after that I performed a clean installation of Java jre V1.8.0_141, reconfigured both the JAVA_HOME and PATH entries in the system environment for Windows and restarted my machine and finally got the script to work.
Hope this helps.
Upvotes: 1
Reputation: 2392
From my "guess" this is a problem with your java version. Maybe you have two different java version installed. Also it looks like you are using code that you copy and paste from somewhere for setting the SPARK_HOME
etc.. There are many simple examples how to set up Spark. Also it looks like that you are using Windows. I would suggest to take a *NIX environment to test things as this is much easier e.g. you could use brew
to install Spark. Windows is not really made for this...
Upvotes: 1