Reputation: 223
I have a spark df as follows:
+----------+-----------+
| date|impressions|
+----------+-----------+
|22/04/2020| 136821|
|23/04/2020| 159688|
|24/04/2020| 165053|
|25/04/2020| 165609|
|26/04/2020| 183574|
+----------+-----------+
Where column date
is of type string formated as %d/%m/%Y. I need that same column to be changed to the supported date format for spark and be of type date.
Upvotes: 1
Views: 640
Reputation: 26
If you don't want to create a new column, just change the original column data type you can use the original column name in withColumn():
changed_df = df.withColumn("date", to_date(col("date"),"dd/MM/yyyy"))
changed_df.printSchema()
#root
# |-- date: date (nullable = true)
# |-- impression: integer (nullable = true)
changed_df.show(10, False)
+----------+----------+
|date |impression|
+----------+----------+
|2020-04-22|136821 |
|2020-04-23|159688 |
|2020-04-24|165053 |
|2020-04-25|165609 |
|2020-04-26|183574 |
+----------+----------+
Upvotes: 0
Reputation: 31540
Use either to_date
,to_timestamp
,from_unixtime(unix_timestamp())
functions for this case
Example:
df.show()
#+----------+-----------+
#| date|impressions|
#+----------+-----------+
#|22/04/2020| 136821|
#+----------+-----------+
from pyspark.sql.functions import *
from pyspark.sql.types import *
df.withColumn("dt",to_date(col("date"),"dd/MM/yyyy")).\
withColumn("dt1",to_timestamp(col("date"),"dd/MM/yyyy").cast("date")).\
withColumn("dt2",from_unixtime(unix_timestamp(col("date"),"dd/MM/yyyy")).cast("date")).\
show()
#+----------+-----------+----------+----------+----------+
#| date|impressions| dt| dt1| dt2|
#+----------+-----------+----------+----------+----------+
#|22/04/2020| 136821|2020-04-22|2020-04-22|2020-04-22|
#+----------+-----------+----------+----------+----------+
df.withColumn("dt",to_date(col("date"),"dd/MM/yyyy")).\
withColumn("dt1",to_timestamp(col("date"),"dd/MM/yyyy").cast("date")).\
withColumn("dt2",from_unixtime(unix_timestamp(col("date"),"dd/MM/yyyy")).cast("date")).\
printSchema()
#root
# |-- date: string (nullable = true)
# |-- impressions: string (nullable = true)
# |-- dt: date (nullable = true)
# |-- dt1: date (nullable = true)
# |-- dt2: date (nullable = true)
Upvotes: 2
Reputation: 337
from pyspark.sql.functions import unix_timestamp, from_unixtime
df.select(
'date',
from_unixtime(unix_timestamp('date', '%d/%m/%Y')).alias('date')
)
Upvotes: 0