chrisTina
chrisTina

Reputation: 2368

Spark - sort by value with a JavaPairRDD

Working with apache spark using Java. I got an JavaPairRDD<String,Long> and I want to sort this dataset by its value. However, it seems that there only is sortByKey method in it. How could I sort it by the value of Long type?

Upvotes: 6

Views: 9323

Answers (4)

kaitav mehta
kaitav mehta

Reputation: 1

JavaPairRDD<String, Long> sorted = reduce.mapToPair(tuple -> new Tuple2<Long, String>(tuple._2, tuple._1))
                    .sortByKey(false)
                    .mapToPair(tuples -> new Tuple2<>(tuples._2, tuples._1));

Swap using mapToPair, then sort, then swap again.

Upvotes: 0

JBoy
JBoy

Reputation: 5735

I did this using a List, which now has a sort(Comparator c) method

List<Tuple2<String,Long>> touples = new ArrayList<>(); touples.addAll(myRdd.collect()); // touples.sort((Tuple2<String, Long> o1, Tuple2<String, Long> o2) -> o2._2.compareTo(o1._2));

It is longer than @Atul solution and i dont know if performance wise is better, on an RDD with 500 items shows no difference, i wonder how does it work with a million records RDD. You can also use Collections.sort and pass in the list provided by the collect and the lambda based Comparator

Upvotes: 0

Atul Soman
Atul Soman

Reputation: 4720

dataset.mapToPair(x -> x.swap()).sortByKey(false).mapToPair(x -> x.swap()).take(100)

Upvotes: 7

maasg
maasg

Reputation: 37435

'Secondary sort' is not supported by Spark yet (See SPARK-3655 for details).

As a workaround, you can sort by value by swaping key <-> value and sorting by key as usual.

In Scala would be something like:

val kv:RDD[String, Long] = ??? 
// swap key and value
val vk = kv.map(_.swap)
val vkSorted = vk.sortByKey

Upvotes: 4

Related Questions