Reputation: 3460
I have a dataset that looks like this. I want to change all values from the date column to proper date time format. dd-mm-2020 (and preferably also arrange all rows in an ascending order according to the date if possible). How can I achieve this in spark?
Name |Type | date |Value |
ALZA CZ|New | 01/01(FRI) | 0
CLPA CZ|New | 01/01(FRI) | 1
ALZA CZ|Old | 01/02(SAT) | 1
CLPA CZ|Old | 01/02(SAT) | 5
The datasource is already converted into a dataframe:
dataframe = datasource0.toDF()
Upvotes: 0
Views: 1715
Reputation: 1387
First things first, you can use regexp_replace
(Python docs here) to change the String of the Date
column, which takes 3 arguments: the column to operate on, the regular expressions of what you want to match, and what you want to replace the matched text with.
As for the date ascending sort, you need to convert the Date
column into DateType
. For that you can use the to_date
method (Python docs here). However, this will inevitably change your date format to YYYY-MM-DD
("that's because, as you'd guess, in ascending order we first look at sorting by year, then by month, and lastly by day to have 2019 come before 2020, then January come before February, and the 1st of the month to come before its 2nd). To handle this, we simply use the date_format
method (Python docs here) after ordering the rows by the date to specify the desired date format that we want to put out.
To test this, I added some rows in your dataframe
:
+-------+----+----------+-----+
| Name|Type| Date|Value|
+-------+----+----------+-----+
|ALZA CZ| New|01/01(FRI)| 0|
|CLPA CZ| New|01/01(FRI)| 1|
|YYYY YY| Old|01/29(FRI)| 1|
|ALZA CZ| Old|01/02(SAT)| 1|
|XXXX XX| New|03/12(SAT)| 5|
|CLPA CZ| Old|01/02(SAT)| 5|
+-------+----+----------+-----+
And then we use the methods from above to convert Date
to what we need . Here I did the regex matching in two steps, one to replace the /
between month and date to -
and one to replace the text with the parentheses with -2020
. Next, I just converted Date
into a DateType
column (by specifying the current MM-dd-yyyy
date format) and ordered the DataFrame rows accordingly, before I converted Date
one last time for the desired dd-MM-yyyy
String format.
// In Python
dataframe.withColumn("Date", regexp_replace("Date", "/", "-"))
.withColumn("Date", regexp_replace("Date", "\\([a-z]+\\)", "-2020"))
.withColumn("Date", to_date("Date", "MM-dd-yyyy"))
.orderBy("Date")
.withColumn("Date", date_format("Date", "dd-MM-yyyy"))
// In Scala
dataframe.withColumn("Date", regexp_replace(col("Date"), "/", "-"))
.withColumn("Date", regexp_replace(col("Date"), "\\([A-Z]+\\)", "-2020"))
.withColumn("Date", to_date(col("Date"), "MM-dd-yyyy"))
.orderBy("Date")
.withColumn("Date", date_format(col("Date"), "dd-MM-yyyy"))
The manipulated dataframe
now looks like this:
+-------+----+----------+-----+
| Name|Type| Date|Value|
+-------+----+----------+-----+
|CLPA CZ| New|01-01-2020| 1|
|ALZA CZ| New|01-01-2020| 0|
|ALZA CZ| Old|02-01-2020| 1|
|CLPA CZ| Old|02-01-2020| 5|
|YYYY YY| Old|29-01-2020| 1|
|XXXX XX| New|12-03-2020| 5|
+-------+----+----------+-----+
Upvotes: 2