Lilly
Lilly

Reputation: 988

Pyspark column: convert string to datetype

I am trying to convert a pyspark column of string type to date type as below.

**Date**
31 Mar 2020
2  Apr 2020
29 Jan 2019
8  Sep 2109

Output required:

31-03-2020
02-04-2020
29-01-2019
08-04-2109  

Thanks.

Upvotes: 2

Views: 587

Answers (2)

notNull
notNull

Reputation: 31490

You can use dayofmonth,year,month (or) date_format() (or) from_unixtime(unix_timestamp()) in built functions for this case.

Example:

#sample data
df=spark.createDataFrame([("31 Mar 2020",),("2 Apr 2020",),("29 Jan 2019",)],["Date"])
#DataFrame[Date: string]
df.show()
#+-----------+
#|       Date|
#+-----------+
#|31 Mar 2020|
#| 2 Apr 2020|
#|29 Jan 2019|
#+-----------+

from pyspark.sql.functions import *

df.withColumn("new_dt", to_date(col("Date"),"dd MMM yyyy")).\
withColumn("year",year(col("new_dt"))).\
withColumn("month",month(col("new_dt"))).\
withColumn("day",dayofmonth(col("new_dt"))).\
show()

#+-----------+----------+----+-----+---+
#|       Date|    new_dt|year|month|day|
#+-----------+----------+----+-----+---+
#|31 Mar 2020|2020-03-31|2020|    3| 31|
#| 2 Apr 2020|2020-04-02|2020|    4|  2|
#|29 Jan 2019|2019-01-29|2019|    1| 29|
#+-----------+----------+----+-----+---+

#using date_format
df.withColumn("new_dt", to_date(col("Date"),"dd MMM yyyy")).\
withColumn("year",date_format(col("new_dt"),"yyyy")).\
withColumn("month",date_format(col("new_dt"),"MM")).\
withColumn("day",date_format(col("new_dt"),"dd")).show()

#+-----------+----------+----+-----+---+
#|       Date|    new_dt|year|month|day|
#+-----------+----------+----+-----+---+
#|31 Mar 2020|2020-03-31|2020|   03| 31|
#| 2 Apr 2020|2020-04-02|2020|   04| 02|
#|29 Jan 2019|2019-01-29|2019|   01| 29|
#+-----------+----------+----+-----+---+

Upvotes: 2

murtihash
murtihash

Reputation: 8410

The to_date function would need days as 02 or ' 2' instead of 2. Therefore, we can use regex to remove spaces, then wherever the length of the string is less than the max(9), we can add 0 to the start of the string. Then we can apply to_date and use it to extract your other columns(day,month,year). Can also use date_format to keep your date in a specified format.

df.show()#sample df
+-----------+
|       Date|
+-----------+
|31 Mar 2020|
|2  Apr 2020|
|29 Jan 2019|
|8  Sep 2019|
+-----------+

from pyspark.sql import functions as F    
df.withColumn("regex", F.regexp_replace("Date","\ ",""))\
  .withColumn("Date", F.when(F.length("regex")<9, F.concat(F.lit(0),F.col("regex")))\
              .otherwise(F.col("regex"))).drop("regex")\
  .withColumn("Date", F.to_date("Date",'ddMMMyyyy'))\
  .withColumn("Year", F.year("Date"))\
  .withColumn("Month",F.month("Date"))\
  .withColumn("Day", F.dayofmonth("Date"))\
  .withColumn("Date_Format2", F.date_format("Date", 'dd-MM-yyyy'))\
  .show()

#output
+----------+----+-----+---+------------+
|      Date|Year|Month|Day|Date_Format2|
+----------+----+-----+---+------------+
|2020-03-31|2020|    3| 31|  31-03-2020|
|2020-04-02|2020|    4|  2|  02-04-2020|
|2019-01-29|2019|    1| 29|  29-01-2019|
|2019-09-08|2019|    9|  8|  08-09-2019|
+----------+----+-----+---+------------+

Upvotes: 1

Related Questions