Mnemosyne
Mnemosyne

Reputation: 1192

How to sort RDD entries using two features simultaneously?

I have a Spark RDD whose entries I want to sort in an organized manner. Let's say the entry is a tuple with 3 elements (name,phonenumber,timestamp). I want to sort the entries first depending on the value of phonenumber and then depending on the value of timestamp while respecting and not changing the sort that was done based on phonenumber. (so timestamp only re-arranges based on the phonenumber sort). Is there a Spark function to do this?

(I am using Spark 2.x with Scala)

Upvotes: 0

Views: 864

Answers (2)

koiralo
koiralo

Reputation: 23119

You can use sortBy function on RDD as below

val df = spark.sparkContext.parallelize(Seq(
  ("a","1", "2017-03-10"),
  ("b","12", "2017-03-9"),
  ("b","123", "2015-03-12"),
  ("c","1234", "2015-03-15"),
  ("c","12345", "2015-03-12")
))//.toDF("name", "phonenumber", "timestamp")

df.sortBy(x => (x._1, x._3)).foreach(println)

Output:

(c,1234,2015-03-15)
(c,12345,2015-03-12)
(b,12,2017-03-9)
(b,123,2015-03-12)
(a,1,2017-03-10)

If you have a dataframe with toDF("name", "phonenumber", "timestamp") Then you could simply do

df.sort("name", "timestamp")

Hope this helps!

Upvotes: 1

Neeraj Bhadani
Neeraj Bhadani

Reputation: 3110

In order to do the sorting based on Multiple elements in RDD, you can use sortBy function. Please find below some sample code in Python. you can similarly implement in other languages as well.

tmp = [('a', 1), ('a', 2), ('1', 3), ('1', 4), ('2', 5)]

sc.parallelize(tmp).sortBy(lambda x: (x[0], x[1]), False).collect()

Regards,

Neeraj

Upvotes: 5

Related Questions