How to sort RDD entries using two features simultaneously?

Question

I have a Spark RDD whose entries I want to sort in an organized manner. Let's say the entry is a tuple with 3 elements (name,phonenumber,timestamp). I want to sort the entries first depending on the value of phonenumber and then depending on the value of timestamp while respecting and not changing the sort that was done based on phonenumber. (so timestamp only re-arranges based on the phonenumber sort). Is there a Spark function to do this?

(I am using Spark 2.x with Scala)

Neeraj Bhadani · Accepted Answer

In order to do the sorting based on Multiple elements in RDD, you can use sortBy function. Please find below some sample code in Python. you can similarly implement in other languages as well.

tmp = [('a', 1), ('a', 2), ('1', 3), ('1', 4), ('2', 5)]

sc.parallelize(tmp).sortBy(lambda x: (x[0], x[1]), False).collect()

Regards,

Neeraj

How to sort RDD entries using two features simultaneously?

Answers (2)

Related Questions