Reputation: 1655
I have an Array of RDD of type Array[RDD[(String, Set[String])]], where each RDD is tuple of key and value. Key is String and Value is Set[String], i want to merge/union Set's with same key. I am trying to do this in scala but no joy. Can you please help me out.
e.g.
RDD["A",Set("1","2")]
RDD["A",Set("3","4")]
RDD["B",Set("1","2")]
RDD["B",Set("3","4")]
RDD["C",Set("1","2")]
RDD["C",Set("3","4")]
After transformation:
RDD["A",Set("1","2","3","4")]
RDD["B",Set("1","2","3","4")]
RDD["C",Set("1","2","3","4")]
Upvotes: 0
Views: 2009
Reputation: 15141
If a single RDD
as output is ok (don't really see any reason to make many RDDs with only 1 record in them) you can reduce your Array
of RDD
into a single RDD
and then do a groupByKey
:
arr.reduce( _ ++ _ )
.groupByKey
.mapValues(_.flatMap(identity))
Exampe:
scala> val x = sc.parallelize( List( ("A", Set(1,2)) ) )
scala> val x2 = sc.parallelize( List( ("A", Set(3,4)) ) )
scala> val arr = Array(x,x2)
arr: Array[org.apache.spark.rdd.RDD[(String, scala.collection.immutable.Set[Int])]] = Array(ParallelCollectionRDD[0] at parallelize at <console>:27, ParallelCollectionRDD[1] at parallelize at <console>:27)
scala> arr.reduce( _ ++ _ ).groupByKey.mapValues(_.flatMap(identity)).foreach(println)
(A,List(1, 2, 3, 4))
@Edit: I find it to be a really bad idea and would advice you to rethink it but you can get the result outut you want by getting all the keys from above and filtering the RDD multiple times:
val sub = arr.reduce( _ ++ _ ).groupByKey.mapValues(_.flatMap(identity))
val keys = sub.map(_._1).collect()
val result = for(k <- keys) yield sub.filter(_._1 == k)
result: Array[org.apache.spark.rdd.RDD[(String, Iterable[Int])]]
Each RDD
will have a single tuple, don't really think it's very useful, performant.
Upvotes: 2