Apache Spark - Intersection of Multiple RDDs

Question

In apache spark one can union multiple RDDs efficiently by using sparkContext.union() method. Is there something similar if someone wants to intersect multiple RDDs? I have searched in sparkContext methods and I could not find anything or anywhere else. One solution could be to union the rdds and then retrieve the duplicates, but I do not think it could be that efficient. Assuming I have the following example with key/value pair collections:

val rdd1 = sc.parallelize(Seq((1,1.0),(2,1.0)))
val rdd2 = sc.parallelize(Seq((1,2.0),(3,4.0),(3,1.0)))

I want to retrieve a new collection which has the following elements:

(1,2.0) (1,1.0)

But of course for multiple rdds and not just two.

user6022341 · Accepted Answer

Try:

val rdds = Seq(
  sc.parallelize(Seq(1, 3, 5)),
  sc.parallelize(Seq(3, 5)),
  sc.parallelize(Seq(1, 3))
)
rdds.map(rdd => rdd.map(x => (x, None))).reduce((x, y) => x.join(y).keys.map(x => (x, None))).keys

Apache Spark - Intersection of Multiple RDDs

Answers (2)

Related Questions