Reputation: 660
Context: The cluster configuration is as follows:
Problem: I'm trying to create a spark session in pyspark jupyter notebook which runs in cluster deploy mode. I'm trying to get the driver to run on a node that isn't the node where the jupyter notebook is run from. Right now I can run jobs on the cluster but only with the driver running on node2.
After a lot of digging I found this stackoverflow post which claims that if you run an interactive shell with spark you can only do so in local deploy mode (where the driver is located on the machine you are working on). That post goes on to say that something like jupyter hub as result also won't work in cluster deploy mode but I am unable to find any documentation that can confirm this. Can someone confirm if jupyter hub can run in cluster mode at all?
My attempt at running the spark session in cluster deployed mode:
from pyspark.sql import SparkSession
spark = SparkSession.builder\
.enableHiveSupport()\
.config("spark.local.ip",<node 3 ip>)\
.config("spark.driver.host",<node 3 ip>)\
.config('spark.submit.deployMode','cluster')\
.getOrCreate()
Error:
/usr/spark/python/pyspark/sql/session.py in getOrCreate(self)
167 for key, value in self._options.items():
168 sparkConf.set(key, value)
--> 169 sc = SparkContext.getOrCreate(sparkConf)
170 # This SparkContext may be an existing one.
171 for key, value in self._options.items():
/usr/spark/python/pyspark/context.py in getOrCreate(cls, conf)
308 with SparkContext._lock:
309 if SparkContext._active_spark_context is None:
--> 310 SparkContext(conf=conf or SparkConf())
311 return SparkContext._active_spark_context
312
/usr/spark/python/pyspark/context.py in __init__(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer, conf, gateway, jsc, profiler_cls)
113 """
114 self._callsite = first_spark_call() or CallSite(None, None, None)
--> 115 SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
116 try:
117 self._do_init(master, appName, sparkHome, pyFiles, environment, batchSize, serializer,
/usr/spark/python/pyspark/context.py in _ensure_initialized(cls, instance, gateway, conf)
257 with SparkContext._lock:
258 if not SparkContext._gateway:
--> 259 SparkContext._gateway = gateway or launch_gateway(conf)
260 SparkContext._jvm = SparkContext._gateway.jvm
261
/usr/spark/python/pyspark/java_gateway.py in launch_gateway(conf)
93 callback_socket.close()
94 if gateway_port is None:
---> 95 raise Exception("Java gateway process exited before sending the driver its port number")
96
97 # In Windows, ensure the Java child processes do not linger after Python has exited.
Exception: Java gateway process exited before sending the driver its port number
Upvotes: 8
Views: 15368
Reputation: 14939
Jupyter
is through
Apache Livy Basically, Livy
is a REST API service for Spark
cluster.
Jupyter
has a extension "spark-magic" that allows to integrate Livy
with Jupyter
An example of Jupyter
with Spark-magic
bound (driver runs in the yarn cluster and not locally in this case, as mentioned above):
Another way to use YARN-cluster
mode in Jupyter is to use Jupyter Enterprise Gateway
https://jupyter-enterprise-gateway.readthedocs.io/en/latest/kernel-yarn-cluster-mode.html#configuring-kernels-for-yarn-cluster-mode
There are also commercial options that normally use one of the methods I listed above. Some of our users for example are on IBM DSX (aka IBM Watson Studio Local) use Apache Livy
- option number one above.
Upvotes: 4
Reputation: 35249
You cannot use cluster mode with PySpark at all:
Currently, standalone mode does not support cluster mode for Python applications.
and even if you could cluster mode is not applicable in interactive environment:
case (_, CLUSTER) if isShell(args.primaryResource) =>
error("Cluster deploy mode is not applicable to Spark shells.")
case (_, CLUSTER) if isSqlShell(args.mainClass) =>
error("Cluster deploy mode is not applicable to Spark SQL shell.")
Upvotes: 7
Reputation: 1553
I'm not an expert in PySpark, but did you try to change the kernel.json file of pyspark jupyter kernel ?
Maybe you can add the option deploy mode cluster in it
"env": {
"SPARK_HOME": "/your_dir/spark",
"PYTHONPATH": "/your_dir/spark/python:/your_dir/spark/python/lib/py4j-0.9-src.zip",
"PYTHONSTARTUP": "/your_dir/spark/python/pyspark/shell.py",
"PYSPARK_SUBMIT_ARGS": "--master local[*] pyspark-shell"
}
And you change this line :
"PYSPARK_SUBMIT_ARGS": "--master local[*] pyspark-shell"
with your cluster master ip and --deploy-mode cluster
Not sure that will change anything, but maybe that works, and I would love to know too !
Good luck
EDIT : I found this maybe that can help yuo even though it's from 2015
Upvotes: 0