reusable sparkcontext instance

I'm quite new to Big Data and currently, I'm working on a CLI project that performs some text parsing using apache spark.

When a command is typed, a new sparkcontext is instantiated and some files are read from a hdfs instance. However, the spark is taking too much time to initialize a sparkcontext or even a sparksession object.

So, my question is:- Is there a way to reuse a sparkcontext instance between these commands to reduce this overhead? I've heard about spark job server, but it's been too hard to deploy a local server since its main guide is a bit confusing.

Thank you.

P.S.: I'm using pyspark

Upvotes: 0

Views: 797

Answers (1)

user1306140
user1306140

Reputation: 11

This is probably not a good idea because your intermediate shuffle files never get cleaned up unless you explicity call rdd.unpersist(). If the shuffle files don't get cleaned up, over a period of time, you will start running into disk space issues on the cluster.

Upvotes: 1

Related Questions