Reputation: 483
Imagine I have a triplet :
val RecordRDD : RDD[Int, String, Int] = {
(5 , "x1", 100),
(3 , "x2", 200),
(3 , "x4", 300),
(5 , "x1", 150),
(3 , "x2", 160),
(5 , "x1", 400)
}
How can I efficiently group them by the first two elements and sort by the third? For example make it:
[5 , [ "x1" -> [100, 150, 400] ]
[3 , [ ["x2" -> [160, 200]], ["x4" -> [300]] ]
I am looking for an efficient way.
Should I make it a DataFrame and make use of GroupBy(Col1,Col2) and SortBy(Col3)?
Would that be more efficient than groupBy of Spark RDD?
Can AggregateByKey aggregate on 2 keys simultaneously?
*You can suppose this RDD is pretty large! Thanks in advance.
Upvotes: 2
Views: 8072
Reputation: 13154
You didn't mention which version of Spark you are running, but one way of doing this with RDDs is like this:
val result = RecordRDD
.map{case(x, y, z) => ((x,y), List(z))}
.reduceByKey(_++_)
.map{case(key, list) => (key._1, Map((key._2 -> list.sorted)))}
.reduceByKey(_++_)
I don't know if it is the most efficient way, but it is pretty efficient ;)
Upvotes: 5