Reputation: 469
I don't understand how the spark session/context lifecycle works. The documentation says that you can have multiple SparkSession
s that share an underlying SparkContext
. But how/when are those created and destroyed? For example, if I have a production cluster and I spark-submit
10 ETLs, will these 10 jobs share the same SparkContext
? Does it matter if I do this in cluster/client mode?
To the best of my understanding, the SparkContext
lives in the driver so I assume the above would result in one SparkContext
shared by 10 SparkSession
s, but I'm not at all sure I got this correctly... Any clarification will be much appreciated.
Upvotes: 3
Views: 1677
Reputation: 1853
Let's understand sparkSession and sparkContext
SparkContext is a channel to access all Spark functionality.The Spark driver program uses it to connect to the cluster manager to communicate, submit Spark jobs and knows what resource manager (YARN) to communicate to.And through SparkContext, the driver can access other contexts such as SQLContext, HiveContext, and StreamingContext to program Spark.
With Spark 2.0, SparkSession can access all aforementioned Spark’s functionality through a single-unified point of entry.
It means SparkSession Encapsulates SparkContext.
Let’s say we have multiple users accessing the same notebook which had shared sparkContext and the requirement was to have an isolated environment sharing the same spark context. Prior to 2.0, the solution to this was to create multiple sparkContexts ie sparkContext per isolated environment or users and is an expensive operation(a single sparkContext exists per JVM). But with the introduction of the spark session, this issue has been addressed.
I spark-submit 10 ETLs, will these 10 jobs share the same SparkContext? Does it matter if I do this in cluster/client mode? To the best of my understanding, the SparkContext lives in the driver so I assume the above would result in one SparkContext shared by 10 SparkSessions,
If you submit 10 ETL spark-submit jobs whether it is cluster/client they all are different application and they have their own sparkContext and sparkSession.In native spark you can't share objects between different application but if you want share objects you have to use share contexts(spark-jobserver).Multiple option are available like Apache Ivy, apache-ignite
Upvotes: 3
Reputation: 2214
There is one SparkContext per spark application.
The documentation says that you can have multiple SparkSessions that share an underlying SparkContext. But how/when are those created and destroyed?
private val conf: SparkConf = new SparkConf()
.setMaster(master)
.setAppName(appName)
.set("spark.ui.enabled", "false")
val ss: SparkSession = SparkSession.builder().config(conf).enableHiveSupport().getOrCreate()
If you have an existing spark session and want to create new one, use the newSession method on the existing SparkSession.
import org.apache.spark.sql.{SQLContext, SparkSession}
val newSparkSession1 = spark.newSession()
val newSparkSession2 = spark.newSession()
The newSession method creates a new spark session with isolated SQL configurations, temporary tables.The new session will share the underlying SparkContext and cached data.
Then you can use these different sessions to submit different jobs/sql queries.
newSparkSession1.sql("<ETL-1>")
newSparkSession2.sql("<ETL-2>")
Does it matter if I do this in cluster/client mode?
Client/Cluster mode doesn't matter.
Upvotes: 2