Reputation: 3399
I have a RDD with these elements:
("a", Array(1, 2)) | ("b", Array(3, 4)) | ("c", Array(1, 2))
I wish to group it using the array in order to have this:
(Array("a", "c"), Array(1, 2)) | (Array("b"), Array(3, 4))
How to do that (preferably in Scala)?
Upvotes: 1
Views: 250
Reputation: 37842
Since you can't use arrays as keys using Spark's default partitioner, you'll have to group by the arrays converted to lists, then just map the results back to the structure you're after:
val input: RDD[(String, Array[Int])] = ???
val result: RDD[(Array[String], Array[Int])] = input
.groupBy(_._2.toList) // group by array
.values // keep values only, of type Iterable[(String, Array[Int])]
.map(it => (it.map(_._1).toArray, it.head._2)) // map back to requested format
Upvotes: 2