WestCoastProjects
WestCoastProjects

Reputation: 63062

Has the limitation of a single SparkContext actually been lifted in Spark 2.0?

There has been plenty of chatter about Spark 2.0 supporting multiple SparkContext s. A configuration variable to support it has been around for much longer but not actually effective.

In $SPARK_HOME/conf/spark-defaults.conf :

spark.driver.allowMultipleContexts true

Let's verify that property were recognized:

scala>     println(s"allowMultiCtx = ${sc.getConf.get("spark.driver.allowMultipleContexts")}")
allowMultiCtx = true

Here is a small poc program for it:

import org.apache.spark._
import org.apache.spark.streaming._
println(s"allowMultiCtx = ${sc.getConf.get("spark.driver.allowMultipleContexts")}")
def createAndStartFileStream(dir: String) = {
  val sc = new SparkContext("local[1]",s"Spark-$dir" /*,conf*/)
  val ssc = new StreamingContext(sc, Seconds(4))
  val dstream = ssc.textFileStream(dir)
  val valuesCounts = dstream.countByValue()
  ssc.start
  ssc.awaitTermination
}
val dirs = Seq("data10m", "data50m", "dataSmall").map { d =>
  s"/shared/demo/data/$d"
}
dirs.foreach{ d =>
  createAndStartFileStream(d)
}

However Attempts to use When that capability are not succeeding:

16/08/14 11:38:55 WARN SparkContext: Multiple running SparkContexts detected 
in the same JVM!
org.apache.spark.SparkException: Only one SparkContext may be running in
this JVM (see SPARK-2243). To ignore this error, 
set spark.driver.allowMultipleContexts = true. 
The currently running SparkContext was created at:
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:814)
org.apache.spark.repl.Main$.createSparkSession(Main.scala:95)

Anyone have any insight on how to use the multiple contexts?

Upvotes: 1

Views: 524

Answers (1)

WestCoastProjects
WestCoastProjects

Reputation: 63062

Per @LostInOverflow this feature will not be fixed. Here is info from that jira

SPARK-2243 Support multiple SparkContexts in the same JVM

https://issues.apache.org/jira/browse/SPARK-2243

Sean Owen added a comment - 16/Jan/16 17:35 You say you're concerned with over-utilizing a cluster for steps that don't require much resource. This is what dynamic allocation is for: the number of executors increases and decreases with load. If one context is already using all cluster resources, yes, that doesn't do anything. But then, neither does a second context; the cluster is already fully used. I don't know what overhead you're referring to, but certainly one context running N jobs is busier than N contexts running N jobs. Its overhead is higher, but the total overhead is lower. This is more an effect than a cause that would make you choose one architecture over another. Generally, Spark has always assumed one context per JVM and I don't see that changing, which is why I finally closed this. I don't see any support for making this happen.

Upvotes: 3

Related Questions