Reputation:
I am building a spark(Running on Apache Spark version 2.4.3) session from Jupiter notebook as follows
spark_session = SparkSession.builder
.master("yarn-client")
.enableHiveSupport()
.getOrCreate()
spark_session.conf.set("spark.executor.memory", '8g')
spark_session.conf.set('spark.executor.cores', '3')
spark_session.conf.set('spark.cores.max', '3')
spark_session.conf.set("spark.driver.memory",'8g')
sc = spark_session.sparkContext
I can see from the application master that all the parameters are being set properly expect the spark.driver.memory. spark.driver.memory no matter what I set its using only 1GB for it.
I have checked spark-default.conf but I there are no parameters such as for spark.driver.memory. To check if its with the session builder/ Jupiter I ran an application using spark-submit from the command-line and to my surprise its picking the driver memory what I am passing.
Can someone please shed some light on this? What could be the reason why its not picking just the spark.driver.memory from the jupyter
Upvotes: 1
Views: 1626
Reputation: 990
Jupyter notebook will launch the pyspark with yarn-client mode, the driver memory and some configs cannot be set with property 'conf' as the JVM driver has already started. you must set it in the command line.
So, to your question - When you run spark in client mode setting a property via "conf.set" will not work as the JVM driver has already started at that point with default config. That's why when you pass the property from the command line it is picking them.
a simple way to start pyspark is
pyspark --driver-memory 2g --executor-memory 2g
Update:
To start jupyter with custom pyspark arguments, create a custom kernel, more on getting started with jupyter kernel: http://cleverowl.uk/2016/10/15/installing-jupyter-with-the-pyspark-and-r-kernels-for-spark-development/
and when you are defining "kernel.json" add --driver-memory 2g --executor-memory 2g
to PYSPARK_SUBMIT_ARGS option.
Upvotes: 3