Tianyi Wang
Tianyi Wang

Reputation: 197

Spark/Scala RDD join always gives me empty result

I have been playing with Spark and I found my join operation doesn't work. Below are part of my code and result in scala console:

scala> val conf = new SparkConf().setMaster("local[*]").setAppName("Part4")
scala> val sc = new SparkContext(conf)


scala> val k1 = sc.parallelize(List((1,3),(1,5),(2,4)))
k1: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[24] at parallelize at <console>:29

scala> val k2 = sc.parallelize(List((1,'A'),(2,'B')))
k2: org.apache.spark.rdd.RDD[(Int, Char)] = ParallelCollectionRDD[25] at parallelize at <console>:29

scala> val k3 = k1.join(k2)
k3: org.apache.spark.rdd.RDD[(Int, (Int, Char))] = MapPartitionsRDD[28] at join at <console>:33

scala> k3.foreach(println)

scala> k3.collect
res33: Array[(Int, (Int, Char))] = Array()

So I just created a toy example with two rdd Lists k1 and k2 with (k,v) pairs and try to join them. However, the result k3 is always empty. We can see k1 and k2 are correctly specified but k3 is empty nevertheless.

What is wrong?

-------Update my question: I think I know where the problem is but I'm still confused:

At first I wrote

 val conf = new SparkConf().setMaster("local[*]").setAppName("Part4")
 val sc = new SparkContext(conf)

When I didn't have those two lines of code, my join worked but when I added those it wouldn't work. Why was that?

Upvotes: 0

Views: 1015

Answers (1)

Michael Lloyd Lee mlk
Michael Lloyd Lee mlk

Reputation: 14661

spark-shell starts up it own Spark Context. Alas Spark does not like multiple contexts running in the same application. When I execute the second line (val sc = new SparkContext(conf)) in spark-shell I get

SNIP LOTS OF ERROR LINES
org.apache.spark.SparkException: Only one SparkContext may be running in this JVM (see SPARK-2243). To ignore this error, set spark.driver.allowMultipleContexts = true. The currently running SparkContext was created at:
org.apache.spark.SparkContext.<init>(SparkContext.scala:82)
SNIP LOTS OF ERROR LINES

Spark has lots of static context and other stuff that means it does not work will when you have two contexts. I'd chalk this down to that however alas I can not prove it.

Upvotes: 1

Related Questions