How to Group By many keys in Spark RDD?

Question

Imagine I have a triplet :

val RecordRDD : RDD[Int, String, Int] = {

                (5 , "x1", 100),
                (3 , "x2", 200),
                (3 , "x4", 300),
                (5 , "x1", 150),
                (3 , "x2", 160),
                (5 , "x1", 400)
  }

How can I efficiently group them by the first two elements and sort by the third? For example make it:

                [5 , [  "x1" -> [100, 150, 400]  ]
                [3 , [   ["x2" -> [160, 200]],   ["x4" -> [300]]    ]

I am looking for an efficient way.

Should I make it a DataFrame and make use of GroupBy(Col1,Col2) and SortBy(Col3)?

Would that be more efficient than groupBy of Spark RDD?

Can AggregateByKey aggregate on 2 keys simultaneously?

*You can suppose this RDD is pretty large! Thanks in advance.

How to Group By many keys in Spark RDD?

Answers (1)

Related Questions