Carter
Carter

Reputation: 1623

How to sort an RDD of tuples with 5 elements in Spark Scala?

If I have an RDD of tuples with 5 elements, e.g., RDD(Double, String, Int, Double, Double)

How can I sort this RDD efficiently using the fifth element?

I tried to map this RDD into key-value pairs and used sortByKey, but looks like sortByKey is quite slow, it is slower than I collected this RDD and used sortWith on the collected array. Why is it like this?

Thank you very much.

Upvotes: 6

Views: 8091

Answers (3)

Sivakumar
Sivakumar

Reputation: 354

If you want to sort by descending order & if the corresponding element is of type int, you can use "-" sign to sort the RDD in descending order.

For ex:

I've a RDD of tuple with (String, Int). To sort this RDD by its 2nd element in descending order,

rdd.sortBy(x => -x._2).collect().foreach(println);

I've a RDD of tuple with (String, String). To sort this RDD by its 2nd element in descending order,

rdd.sortBy(x => x._2, false).collect().foreach(println);

Upvotes: 3

marios
marios

Reputation: 8996

sortByKey is the only distributed sorting API for Spark 1.0.

How much data are you trying to sort? Small amount will result in faster local/centralized sorting. If you try to sort GB and GB of data that may not even fit on a single node, that's where Spark shines.

Upvotes: 1

Shadowlands
Shadowlands

Reputation: 15074

You can do this with sortBy acting directly on the RDD:

myRdd.sortBy(_._5) // Sort by 5th field of each 5-tuple

There are extra optional parameters to define sort order ("ascending") and number of partitions.

Upvotes: 10

Related Questions