Reputation: 1623
If I have an RDD of tuples with 5 elements, e.g., RDD(Double, String, Int, Double, Double)
How can I sort this RDD efficiently using the fifth element?
I tried to map this RDD into key-value pairs and used sortByKey, but looks like sortByKey is quite slow, it is slower than I collected this RDD and used sortWith on the collected array. Why is it like this?
Thank you very much.
Upvotes: 6
Views: 8091
Reputation: 354
If you want to sort by descending order & if the corresponding element is of type int, you can use "-" sign to sort the RDD in descending order.
For ex:
I've a RDD of tuple with (String, Int). To sort this RDD by its 2nd element in descending order,
rdd.sortBy(x => -x._2).collect().foreach(println);
I've a RDD of tuple with (String, String). To sort this RDD by its 2nd element in descending order,
rdd.sortBy(x => x._2, false).collect().foreach(println);
Upvotes: 3
Reputation: 8996
sortByKey
is the only distributed sorting API for Spark 1.0.
How much data are you trying to sort? Small amount will result in faster local/centralized sorting. If you try to sort GB and GB of data that may not even fit on a single node, that's where Spark shines.
Upvotes: 1
Reputation: 15074
You can do this with sortBy
acting directly on the RDD
:
myRdd.sortBy(_._5) // Sort by 5th field of each 5-tuple
There are extra optional parameters to define sort order ("ascending") and number of partitions.
Upvotes: 10