Reputation: 41
I have around 10 Spark jobs where each would do some transformation and load data into Database. The Spark session has to be opened individually for each job and closed and every time initialization consumes time.
Is it possible to create the Spark session only once and re-use the same across multiple jobs ?
Upvotes: 4
Views: 4613
Reputation: 2178
You can submit your job once, in other words do spark-submit once. Inside the code which is submitted you can have 10 calls each doing some transformation and load data into Database.
val spark : SparkSession = SparkSession.builder
.appName("Multiple-jobs")
.master("<cluster name>")
.getOrCreate()
method1()
method2()
def method1():Unit = {
//it will give the same spark session created outside the method.
val spark = SparkSession.builder.getOrCreate()
//work
}
However if the job is time consuming say it takes 10 minutes then in comparision you wouldn't be spending a lot of time in creating separate spark sessions. I wouldn't worry about 1 spark session per job. However I will be worried if a separate Spark session is created per method or per unit test case, that is where I will save spark sessions.
Upvotes: 0
Reputation: 311
Technically if you use a single Spark Session you will end-up having a single Spark application, because you will have to package and run multiple ETL (Extract, Transform, & Load) within a single JAR file.
If you are running those jobs in production cluster, most likely you are using spark-submit to execute your application jar
, which will have to go through initializing phase every-time you submits a job through Spark Master -> Workers in client
mode.
In general, having a long running spark session is mostly suitable for prototyping, troubleshooting and debugging purposes, for example a single spark session can be leveraged in spark-shell
, or any other interactive development environment, like Zeppelin; but, not with spark-submit
as far as I know.
All in all, a couple of design/business questions is worth to consider here; does merging multiple ETL jobs together will generate a code that is easy to sustain, manage and debug? Does it provide the required performance gain? Risk/Cost analysis ? etc.
Hope this would help
Upvotes: 3