Reputation: 51
I'm quite new to Big Data and currently, I'm working on a CLI project that performs some text parsing using apache spark.
When a command is typed, a new sparkcontext is instantiated and some files are read from a hdfs instance. However, the spark is taking too much time to initialize a sparkcontext or even a sparksession object.
So, my question is:- Is there a way to reuse a sparkcontext instance between these commands to reduce this overhead? I've heard about spark job server, but it's been too hard to deploy a local server since its main guide is a bit confusing.
Thank you.
P.S.: I'm using pyspark
Upvotes: 0
Views: 797
Reputation: 11
This is probably not a good idea because your intermediate shuffle files never get cleaned up unless you explicity call rdd.unpersist()
. If the shuffle files don't get cleaned up, over a period of time, you will start running into disk space issues on the cluster.
Upvotes: 1