Array[RDD[(String, Set[String])]] transformation in Spark Scala

Question

I have an Array of RDD of type Array[RDD[(String, Set[String])]], where each RDD is tuple of key and value. Key is String and Value is Set[String], i want to merge/union Set's with same key. I am trying to do this in scala but no joy. Can you please help me out.

e.g.
RDD["A",Set("1","2")]
RDD["A",Set("3","4")]
RDD["B",Set("1","2")]
RDD["B",Set("3","4")]
RDD["C",Set("1","2")]
RDD["C",Set("3","4")]

After transformation:
RDD["A",Set("1","2","3","4")]
RDD["B",Set("1","2","3","4")]
RDD["C",Set("1","2","3","4")]

Mateusz Dymczyk · Accepted Answer

If a single RDD as output is ok (don't really see any reason to make many RDDs with only 1 record in them) you can reduce your Array of RDD into a single RDD and then do a groupByKey:

arr.reduce( _ ++ _ )
   .groupByKey
   .mapValues(_.flatMap(identity))

Exampe:

scala> val x = sc.parallelize( List( ("A", Set(1,2)) ) )
scala> val x2 = sc.parallelize( List( ("A", Set(3,4)) ) )
scala> val arr = Array(x,x2)
arr: Array[org.apache.spark.rdd.RDD[(String, scala.collection.immutable.Set[Int])]] = Array(ParallelCollectionRDD[0] at parallelize at :27, ParallelCollectionRDD[1] at parallelize at :27)
scala> arr.reduce( _ ++ _ ).groupByKey.mapValues(_.flatMap(identity)).foreach(println)
(A,List(1, 2, 3, 4))

@Edit: I find it to be a really bad idea and would advice you to rethink it but you can get the result outut you want by getting all the keys from above and filtering the RDD multiple times:

val sub = arr.reduce( _ ++ _ ).groupByKey.mapValues(_.flatMap(identity))
val keys = sub.map(_._1).collect()
val result = for(k <- keys) yield sub.filter(_._1 == k)
result: Array[org.apache.spark.rdd.RDD[(String, Iterable[Int])]]

Each RDD will have a single tuple, don't really think it's very useful, performant.

Array[RDD[(String, Set[String])]] transformation in Spark Scala

Answers (1)

Related Questions