Reputation: 2368
Working with apache spark
using Java
. I got an JavaPairRDD<String,Long>
and I want to sort this dataset by its value. However, it seems that there only is sortByKey
method in it. How could I sort it by the value of Long
type?
Upvotes: 6
Views: 9323
Reputation: 1
JavaPairRDD<String, Long> sorted = reduce.mapToPair(tuple -> new Tuple2<Long, String>(tuple._2, tuple._1))
.sortByKey(false)
.mapToPair(tuples -> new Tuple2<>(tuples._2, tuples._1));
Swap using mapToPair, then sort, then swap again.
Upvotes: 0
Reputation: 5735
I did this using a List, which now has a sort(Comparator c)
method
List<Tuple2<String,Long>> touples = new ArrayList<>();
touples.addAll(myRdd.collect()); //
touples.sort((Tuple2<String, Long> o1, Tuple2<String, Long> o2) -> o2._2.compareTo(o1._2));
It is longer than @Atul solution and i dont know if performance wise is better, on an RDD with 500 items shows no difference, i wonder how does it work with a million records RDD.
You can also use Collections.sort
and pass in the list provided by the collect
and the lambda based Comparator
Upvotes: 0
Reputation: 4720
dataset.mapToPair(x -> x.swap()).sortByKey(false).mapToPair(x -> x.swap()).take(100)
Upvotes: 7
Reputation: 37435
'Secondary sort' is not supported by Spark yet (See SPARK-3655 for details).
As a workaround, you can sort by value by swaping key <-> value and sorting by key as usual.
In Scala would be something like:
val kv:RDD[String, Long] = ???
// swap key and value
val vk = kv.map(_.swap)
val vkSorted = vk.sortByKey
Upvotes: 4