Spar
Spar

Reputation: 483

How to Group By many keys in Spark RDD?

Imagine I have a triplet :

val RecordRDD : RDD[Int, String, Int] = {

                (5 , "x1", 100),
                (3 , "x2", 200),
                (3 , "x4", 300),
                (5 , "x1", 150),
                (3 , "x2", 160),
                (5 , "x1", 400)
  }

How can I efficiently group them by the first two elements and sort by the third? For example make it:

                [5 , [  "x1" -> [100, 150, 400]  ]
                [3 , [   ["x2" -> [160, 200]],   ["x4" -> [300]]    ]

I am looking for an efficient way.

Should I make it a DataFrame and make use of GroupBy(Col1,Col2) and SortBy(Col3)?

Would that be more efficient than groupBy of Spark RDD?

Can AggregateByKey aggregate on 2 keys simultaneously?

*You can suppose this RDD is pretty large! Thanks in advance.

Upvotes: 2

Views: 8072

Answers (1)

Glennie Helles Sindholt
Glennie Helles Sindholt

Reputation: 13154

You didn't mention which version of Spark you are running, but one way of doing this with RDDs is like this:

val result = RecordRDD
  .map{case(x, y, z) => ((x,y), List(z))}
  .reduceByKey(_++_)
  .map{case(key, list) => (key._1, Map((key._2 -> list.sorted)))}
  .reduceByKey(_++_)

I don't know if it is the most efficient way, but it is pretty efficient ;)

Upvotes: 5

Related Questions