x89
x89

Reputation: 3460

change date format in glue

I have a dataset that looks like this. I want to change all values from the date column to proper date time format. dd-mm-2020 (and preferably also arrange all rows in an ascending order according to the date if possible). How can I achieve this in spark?

Name   |Type        |   date      |Value    |
ALZA CZ|New         | 01/01(FRI)  |     0
CLPA CZ|New         | 01/01(FRI)  |     1
ALZA CZ|Old         | 01/02(SAT)  |     1
CLPA CZ|Old         | 01/02(SAT)  |     5

The datasource is already converted into a dataframe:

dataframe = datasource0.toDF()

Upvotes: 0

Views: 1715

Answers (1)

Coursal
Coursal

Reputation: 1387

First things first, you can use regexp_replace (Python docs here) to change the String of the Date column, which takes 3 arguments: the column to operate on, the regular expressions of what you want to match, and what you want to replace the matched text with.

As for the date ascending sort, you need to convert the Date column into DateType. For that you can use the to_date method (Python docs here). However, this will inevitably change your date format to YYYY-MM-DD ("that's because, as you'd guess, in ascending order we first look at sorting by year, then by month, and lastly by day to have 2019 come before 2020, then January come before February, and the 1st of the month to come before its 2nd). To handle this, we simply use the date_format method (Python docs here) after ordering the rows by the date to specify the desired date format that we want to put out.

To test this, I added some rows in your dataframe:

+-------+----+----------+-----+
|   Name|Type|      Date|Value|
+-------+----+----------+-----+
|ALZA CZ| New|01/01(FRI)|    0|
|CLPA CZ| New|01/01(FRI)|    1|
|YYYY YY| Old|01/29(FRI)|    1|
|ALZA CZ| Old|01/02(SAT)|    1|
|XXXX XX| New|03/12(SAT)|    5|
|CLPA CZ| Old|01/02(SAT)|    5|
+-------+----+----------+-----+

And then we use the methods from above to convert Date to what we need . Here I did the regex matching in two steps, one to replace the / between month and date to - and one to replace the text with the parentheses with -2020. Next, I just converted Date into a DateType column (by specifying the current MM-dd-yyyy date format) and ordered the DataFrame rows accordingly, before I converted Date one last time for the desired dd-MM-yyyy String format.

// In Python
dataframe.withColumn("Date", regexp_replace("Date", "/", "-"))
          .withColumn("Date", regexp_replace("Date", "\\([a-z]+\\)", "-2020"))
          .withColumn("Date", to_date("Date", "MM-dd-yyyy"))
          .orderBy("Date")
          .withColumn("Date", date_format("Date", "dd-MM-yyyy"))


// In Scala
dataframe.withColumn("Date", regexp_replace(col("Date"), "/", "-"))
          .withColumn("Date", regexp_replace(col("Date"), "\\([A-Z]+\\)", "-2020"))
          .withColumn("Date", to_date(col("Date"), "MM-dd-yyyy"))
          .orderBy("Date")
          .withColumn("Date", date_format(col("Date"), "dd-MM-yyyy"))

The manipulated dataframe now looks like this:

+-------+----+----------+-----+
|   Name|Type|      Date|Value|
+-------+----+----------+-----+
|CLPA CZ| New|01-01-2020|    1|
|ALZA CZ| New|01-01-2020|    0|
|ALZA CZ| Old|02-01-2020|    1|
|CLPA CZ| Old|02-01-2020|    5|
|YYYY YY| Old|29-01-2020|    1|
|XXXX XX| New|12-03-2020|    5|
+-------+----+----------+-----+

Upvotes: 2

Related Questions