Garry Steve
Garry Steve

Reputation: 129

Spark custom sort column in Java

I have a below result of Dataset.

1.

+------+---------+--------+
| Col1 |  Col2   | NumCol |
+------+---------+--------+
| abc  | jun2016 |     25 |
| aac  | jun2017 |     28 |
| aac  | dec2017 |     30 |
| aac  | apr2018 |     45 |
+------+---------+--------+

when sorting is applied I get below result.

+------+---------+--------+
| Col1 |  Col2   | NumCol |
+------+---------+--------+
| aac  | apr2018 |     45 |
| aac  | dec2017 |     30 |
| aac  | jun2017 |     28 |
| abc  | jun2018 |     25 |
+------+---------+--------+

But instead should have been

+------+---------+--------+
| Col1 |  Col2   | NumCol |
+------+---------+--------+
| aac  | jun2017 |     28 |
| aac  | dec2017 |     30 |
| aac  | apr2018 |     45 |
| abc  | jun2018 |     25 |
+------+---------+--------+

According to chronological order. How will I be able to achieve the same.

2. When Week is present

When I have a column as Week as below

+------+-----------------------+--------+
| Col1 |         Week          | NumCol |
+------+-----------------------+--------+
| aac  | 02/04/2018-02/10/2018 |     45 |
| aac  | 02/11/2018-02/17/2018 |     25 |
| aac  | 01/28/2018-02/03/2018 |     30 |
+------+-----------------------+--------+

I want that to get sorted as below.

+------+-----------------------+--------+
| Col1 |         Week          | NumCol |
+------+-----------------------+--------+
| aac  | 01/28/2018-02/03/2018 |     30 |
| aac  | 02/04/2018-02/10/2018 |     45 |
| aac  | 02/11/2018-02/17/2018 |     25 |
+------+-----------------------+--------+

Here above I want to parse date of the column week as new Column dateweek , then sort the Week column and delete before returning the dataset.

Kind of challenging stuff for me.

for #1 I followed this But the issue with this is if there's suppose Jan2016,Feb2016,Jan2017 it gets sorted as Jan2016,Jan2017,Feb2016.

Need help for 2

Upvotes: 0

Views: 550

Answers (1)

Kishore
Kishore

Reputation: 5881

split the week and sort based on date.

import org.apache.spark.sql.functions.split
df.withColumn("_tmp", split($"Week", "-")).select($"Col1", $"Week", $"NumCCol1", $"_tmp".getItem(0).as("_sort")).sort("_sort").drop("_sort").show()

output-

+----+---------------------+--------+
|Col1|Week                 |NumCCol1|
+----+---------------------+--------+
|aac |01/28/2018-02/03/2018|30      |
|aac |02/04/2018-02/10/2018|45      |
|aac |02/11/2018-02/17/2018|25      |
+----+---------------------+--------+

Upvotes: 1

Related Questions