nilesh1212
nilesh1212

Reputation: 1655

Array[RDD[(String, Set[String])]] transformation in Spark Scala

I have an Array of RDD of type Array[RDD[(String, Set[String])]], where each RDD is tuple of key and value. Key is String and Value is Set[String], i want to merge/union Set's with same key. I am trying to do this in scala but no joy. Can you please help me out.

e.g.
RDD["A",Set("1","2")]
RDD["A",Set("3","4")]
RDD["B",Set("1","2")]
RDD["B",Set("3","4")]
RDD["C",Set("1","2")]
RDD["C",Set("3","4")]

After transformation:
RDD["A",Set("1","2","3","4")]
RDD["B",Set("1","2","3","4")]
RDD["C",Set("1","2","3","4")]

Upvotes: 0

Views: 2009

Answers (1)

Mateusz Dymczyk
Mateusz Dymczyk

Reputation: 15141

If a single RDD as output is ok (don't really see any reason to make many RDDs with only 1 record in them) you can reduce your Array of RDD into a single RDD and then do a groupByKey:

arr.reduce( _ ++ _ )
   .groupByKey
   .mapValues(_.flatMap(identity))

Exampe:

scala> val x = sc.parallelize( List( ("A", Set(1,2)) ) )
scala> val x2 = sc.parallelize( List( ("A", Set(3,4)) ) )
scala> val arr = Array(x,x2)
arr: Array[org.apache.spark.rdd.RDD[(String, scala.collection.immutable.Set[Int])]] = Array(ParallelCollectionRDD[0] at parallelize at <console>:27, ParallelCollectionRDD[1] at parallelize at <console>:27)
scala> arr.reduce( _ ++ _ ).groupByKey.mapValues(_.flatMap(identity)).foreach(println)
(A,List(1, 2, 3, 4))

@Edit: I find it to be a really bad idea and would advice you to rethink it but you can get the result outut you want by getting all the keys from above and filtering the RDD multiple times:

val sub = arr.reduce( _ ++ _ ).groupByKey.mapValues(_.flatMap(identity))
val keys = sub.map(_._1).collect()
val result = for(k <- keys) yield sub.filter(_._1 == k)
result: Array[org.apache.spark.rdd.RDD[(String, Iterable[Int])]]

Each RDD will have a single tuple, don't really think it's very useful, performant.

Upvotes: 2

Related Questions