Reputation: 988
I am trying to convert a pyspark column of string type to date type as below.
**Date**
31 Mar 2020
2 Apr 2020
29 Jan 2019
8 Sep 2109
Output required:
31-03-2020
02-04-2020
29-01-2019
08-04-2109
Thanks.
Upvotes: 2
Views: 587
Reputation: 31490
You can use dayofmonth,year,month
(or) date_format()
(or) from_unixtime(unix_timestamp())
in built functions for this case.
Example:
#sample data
df=spark.createDataFrame([("31 Mar 2020",),("2 Apr 2020",),("29 Jan 2019",)],["Date"])
#DataFrame[Date: string]
df.show()
#+-----------+
#| Date|
#+-----------+
#|31 Mar 2020|
#| 2 Apr 2020|
#|29 Jan 2019|
#+-----------+
from pyspark.sql.functions import *
df.withColumn("new_dt", to_date(col("Date"),"dd MMM yyyy")).\
withColumn("year",year(col("new_dt"))).\
withColumn("month",month(col("new_dt"))).\
withColumn("day",dayofmonth(col("new_dt"))).\
show()
#+-----------+----------+----+-----+---+
#| Date| new_dt|year|month|day|
#+-----------+----------+----+-----+---+
#|31 Mar 2020|2020-03-31|2020| 3| 31|
#| 2 Apr 2020|2020-04-02|2020| 4| 2|
#|29 Jan 2019|2019-01-29|2019| 1| 29|
#+-----------+----------+----+-----+---+
#using date_format
df.withColumn("new_dt", to_date(col("Date"),"dd MMM yyyy")).\
withColumn("year",date_format(col("new_dt"),"yyyy")).\
withColumn("month",date_format(col("new_dt"),"MM")).\
withColumn("day",date_format(col("new_dt"),"dd")).show()
#+-----------+----------+----+-----+---+
#| Date| new_dt|year|month|day|
#+-----------+----------+----+-----+---+
#|31 Mar 2020|2020-03-31|2020| 03| 31|
#| 2 Apr 2020|2020-04-02|2020| 04| 02|
#|29 Jan 2019|2019-01-29|2019| 01| 29|
#+-----------+----------+----+-----+---+
Upvotes: 2
Reputation: 8410
The to_date
function would need days as 02
or ' 2'
instead of 2
. Therefore, we can use regex
to remove spaces, then wherever the length
of the string is less than
the max(9)
, we can add 0 to the start
of the string. Then we can apply to_date
and use it to extract your other columns(day,month,year). Can also use date_format
to keep your date in a specified format.
df.show()#sample df
+-----------+
| Date|
+-----------+
|31 Mar 2020|
|2 Apr 2020|
|29 Jan 2019|
|8 Sep 2019|
+-----------+
from pyspark.sql import functions as F
df.withColumn("regex", F.regexp_replace("Date","\ ",""))\
.withColumn("Date", F.when(F.length("regex")<9, F.concat(F.lit(0),F.col("regex")))\
.otherwise(F.col("regex"))).drop("regex")\
.withColumn("Date", F.to_date("Date",'ddMMMyyyy'))\
.withColumn("Year", F.year("Date"))\
.withColumn("Month",F.month("Date"))\
.withColumn("Day", F.dayofmonth("Date"))\
.withColumn("Date_Format2", F.date_format("Date", 'dd-MM-yyyy'))\
.show()
#output
+----------+----+-----+---+------------+
| Date|Year|Month|Day|Date_Format2|
+----------+----+-----+---+------------+
|2020-03-31|2020| 3| 31| 31-03-2020|
|2020-04-02|2020| 4| 2| 02-04-2020|
|2019-01-29|2019| 1| 29| 29-01-2019|
|2019-09-08|2019| 9| 8| 08-09-2019|
+----------+----+-----+---+------------+
Upvotes: 1