Reputation: 1192
I have a Spark RDD whose entries I want to sort in an organized manner. Let's say the entry is a tuple with 3 elements (name,phonenumber,timestamp)
. I want to sort the entries first depending on the value of phonenumber
and then depending on the value of timestamp
while respecting and not changing the sort that was done based on phonenumber
. (so timestamp
only re-arranges based on the phonenumber
sort). Is there a Spark function to do this?
(I am using Spark 2.x with Scala)
Upvotes: 0
Views: 864
Reputation: 23119
You can use sortBy
function on RDD
as below
val df = spark.sparkContext.parallelize(Seq(
("a","1", "2017-03-10"),
("b","12", "2017-03-9"),
("b","123", "2015-03-12"),
("c","1234", "2015-03-15"),
("c","12345", "2015-03-12")
))//.toDF("name", "phonenumber", "timestamp")
df.sortBy(x => (x._1, x._3)).foreach(println)
Output:
(c,1234,2015-03-15)
(c,12345,2015-03-12)
(b,12,2017-03-9)
(b,123,2015-03-12)
(a,1,2017-03-10)
If you have a dataframe with toDF("name", "phonenumber", "timestamp")
Then you could simply do
df.sort("name", "timestamp")
Hope this helps!
Upvotes: 1
Reputation: 3110
In order to do the sorting based on Multiple elements in RDD, you can use sortBy
function. Please find below some sample code in Python. you can similarly implement in other languages as well.
tmp = [('a', 1), ('a', 2), ('1', 3), ('1', 4), ('2', 5)]
sc.parallelize(tmp).sortBy(lambda x: (x[0], x[1]), False).collect()
Regards,
Neeraj
Upvotes: 5