Armand Grillet
Armand Grillet

Reputation: 3399

Grouping a RDD using an array

I have a RDD with these elements:

("a", Array(1, 2)) | ("b", Array(3, 4)) | ("c", Array(1, 2))

I wish to group it using the array in order to have this:

(Array("a", "c"), Array(1, 2)) | (Array("b"), Array(3, 4))

How to do that (preferably in Scala)?

Upvotes: 1

Views: 250

Answers (1)

Tzach Zohar
Tzach Zohar

Reputation: 37842

Since you can't use arrays as keys using Spark's default partitioner, you'll have to group by the arrays converted to lists, then just map the results back to the structure you're after:

val input: RDD[(String, Array[Int])] = ???

val result: RDD[(Array[String], Array[Int])] = input
  .groupBy(_._2.toList) // group by array
  .values // keep values only, of type Iterable[(String, Array[Int])]
  .map(it => (it.map(_._1).toArray, it.head._2)) // map back to requested format

Upvotes: 2

Related Questions