Aram Maliachi
Aram Maliachi

Reputation: 223

String date column as %d/%m/%Y to date column in pyspark

I have a spark df as follows:

+----------+-----------+
|      date|impressions|
+----------+-----------+
|22/04/2020|     136821|
|23/04/2020|     159688|
|24/04/2020|     165053|
|25/04/2020|     165609|
|26/04/2020|     183574|
+----------+-----------+

Where column date is of type string formated as %d/%m/%Y. I need that same column to be changed to the supported date format for spark and be of type date.

Upvotes: 1

Views: 640

Answers (3)

stanleywxc
stanleywxc

Reputation: 26

If you don't want to create a new column, just change the original column data type you can use the original column name in withColumn():

changed_df = df.withColumn("date", to_date(col("date"),"dd/MM/yyyy"))
changed_df.printSchema()
#root
# |-- date: date (nullable = true)
# |-- impression: integer (nullable = true)
changed_df.show(10, False)
+----------+----------+
|date      |impression|
+----------+----------+
|2020-04-22|136821    |
|2020-04-23|159688    |
|2020-04-24|165053    |
|2020-04-25|165609    |
|2020-04-26|183574    |
+----------+----------+

Upvotes: 0

notNull
notNull

Reputation: 31540

Use either to_date,to_timestamp,from_unixtime(unix_timestamp()) functions for this case

Example:

df.show()
#+----------+-----------+
#|      date|impressions|
#+----------+-----------+
#|22/04/2020|     136821|
#+----------+-----------+
from pyspark.sql.functions import *
from pyspark.sql.types import *
df.withColumn("dt",to_date(col("date"),"dd/MM/yyyy")).\
withColumn("dt1",to_timestamp(col("date"),"dd/MM/yyyy").cast("date")).\
withColumn("dt2",from_unixtime(unix_timestamp(col("date"),"dd/MM/yyyy")).cast("date")).\
show()
#+----------+-----------+----------+----------+----------+
#|      date|impressions|        dt|       dt1|       dt2|
#+----------+-----------+----------+----------+----------+
#|22/04/2020|     136821|2020-04-22|2020-04-22|2020-04-22|
#+----------+-----------+----------+----------+----------+

df.withColumn("dt",to_date(col("date"),"dd/MM/yyyy")).\
withColumn("dt1",to_timestamp(col("date"),"dd/MM/yyyy").cast("date")).\
withColumn("dt2",from_unixtime(unix_timestamp(col("date"),"dd/MM/yyyy")).cast("date")).\
printSchema()
#root
# |-- date: string (nullable = true)
# |-- impressions: string (nullable = true)
# |-- dt: date (nullable = true)
# |-- dt1: date (nullable = true)
# |-- dt2: date (nullable = true)

Upvotes: 2

Jonathan Besomi
Jonathan Besomi

Reputation: 337

from pyspark.sql.functions import unix_timestamp, from_unixtime
df.select(
    'date', 
    from_unixtime(unix_timestamp('date', '%d/%m/%Y')).alias('date')
)

Upvotes: 0

Related Questions